Cell microprocessor implementations: Difference between revisions

Content deleted Content added
EdgeOfEpsilon (talk | contribs)
Colloquialism
Cewbot (talk | contribs)
m Fixing broken anchor: #Cell based Blades→most alike anchor IBM BladeCenter#Cell based
 
(38 intermediate revisions by 28 users not shown)
Line 1:
{{context|date=January 2020}}
'''Cell''' microprocessors are multi-core processors that use cellular architecture for high performance distributed computing. The first commercial [[Cell microprocessor]], the Cell BE, was designed for the Sony PlayStation 3. IBM designed the PowerXCell 8i for use in the [[Roadrunner supercomputer]].<ref>
Kevin J. Barker, Kei Davis, Adolfy Hoisie, Darren J. Kerbyson, Mike Lang, Scott Pakin, Jose C. Sancho.
[https://www.academia.edu/4460100/Entering_the_petaflop_era_the_architecture_and_performance_of_Roadrunner "Entering the Petaflop Era:The Architecture and Performance of Roadrunner"].
</ref>
 
==Implementation==
{{Cell microprocessor segments}}
 
===First edition Cell on 90 nm CMOS===
{| class="wikitable" align="leftright" style="font-size:90%"
 
|+ '''Known Cell Variantsvariants in 90 &nbsp;nm Process'''process
IBM has published information concerning two different versions of Cell in this process, an early engineering sample designated ''DD1'', and an enhanced version designated ''DD2'' intended for production.
! Designation ||!! Die Areaarea ||!! First Discloseddisclosed ||!! Enhancement
{| class="wikitable" align="left"
|+ '''Known Cell Variants in 90 nm Process'''
! Designation || Die Area || First Disclosed || Enhancement
|-
| DD1 || 221 &nbsp;mm²<sup>2</sup> || ISSCC 2005 ||
|-
| DD2 || 235 &nbsp;mm²<sup>2</sup> || Cool Chips April 2005 || enhancedEnhanced PPE core
|}
IBM has published information concerning two different versions of Cell in this process, an early engineering sample designated ''DD1'', and an enhanced version designated ''DD2'' intended for production.
<br style="clear:both;">
 
The main enhancement in DD2 was a small lengthening of the die to accommodate a larger PPE core, which is reported to "contain more SIMD/vector execution resources"{{ref|dtwang3}}.
Some preliminary information released by IBM references the DD1 variant. As a result, some early journalistic accounts of the Cell's capabilities now differ from production hardware.
 
{{clear}}
====Cell floorplan====
[Powerpoint material accompanying an STI presentation given by Dr Peter Hofstee], includes a photograph of the DD2 Cell die overdrawn with functional unit boundaries which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows:
 
====Cell floorplan====
<!-- I wasn't able to square-up the image when measuring dimensions off screen, since I only had area, but not W*H so these are a little more approx. than they might have been; I had W*H for the included SPU but the top four SPU measure differently than the bottom four in the way they overdrew the boundaries, leaving me with the same problem -->
 
{| class="wikitable" align="right" style="marginfont-size: 1em auto 1em auto90%"
|+ '''Cell Functionfunction Unitsunits and Footprint'''footprint
! Cell function unit ||!! Area (%) ||!! Description
|-
| XDR interface || {{0}}5.7% || interfaceInterface to Rambus system memory
|-
| memory controller || {{0}}4.4% || managesManages external memory and L2 cache
|-
| 512 KiB L2 cache || 10.3% || cacheCache memory for the PPE
|-
| PPE core || 11.1% || PowerPC processor
|-
| test || {{0}}2.0% || unspecifiedUnspecified "test and decode logic"
|-
| EIB || {{0}}3.1% || elementElement interconnect bus linking processors
|-
| SPE (each) x× 8 || {{0}}6.2% || synergisticSynergistic coprocessing element
|-
| I/O controller || {{0}}6.6% || externalExternal I/O logic
|-
| Rambus FlexIO || {{0}}5.7% || externalExternal signalling for I/O pins
|}
[Powerpoint material accompanying an STI presentation given by Dr Peter Hofstee], includes a photograph of the DD2 Cell die overdrawn with functional unit boundaries which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows:
 
<br style="{{clear:both;">}}
 
====SPE floorplan====
Additional details concerning the internal SPE implementation have been disclosed by IBM engineers, including [[Peter Hofstee]], IBM's chief architect of the synergistic processing element, in a scholarly IEEE publication.{{ref|90nmsoi}}
 
This document includes a photograph of the 2.54 x 5.81 mm SPE, as implemented in 90-nm [[SOI]]. In this technology, the SPE contains 21 million transistors of which 14 million are contained in arrays (a term presumably designating register files and the local store) and 7 million transistors are logic. This photograph is overdrawn with functional unit boundaries, which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows:
 
<!-- if I had found a way I would have made the table smaller than the surrounding text, esp. if the surrounding text is largish, but that's asking a lot -->
{| class="wikitable" <!-- bad effects at certain Firefox sizings with style="text-align:left" --> stylealign="margin: 1em auto 1em autoright" alignstyle="rightfont-size:90%"
|+ '''SPU Functionfunction Unitsunits and Footprint'''footprint
! SPU function <br>unit ||!! Area (%) ||!! Description ||!! Pipe
|-
| single precision || 10.0% || single precision FP execution unit || even
|-
| double precision || 4.4 || double precision FP execution unit || even
|-
| simpledouble fixedprecision || 3{{0}}4.254% || fixeddouble pointprecision FP execution unit || even
|-
| issuesimple controlfixed || 2{{0}}3.525% || feedsfixed point execution unitsunit || even
|-
| forward macro || 3.75 || feeds execution units
|-
| GPR || 6.25 || general purpose register file
|-
| permute || 3.25 || permute execution unit || odd
|-
| branchissue control || {{0}}2.5% || branchfeeds execution unit || odd units
|-
| channel || 6.75 || channel interface (three discrete blocks) || odd
|-
| forward macro || {{0}}3.75% || feeds execution units
| LS0-LS3 || 30.0 || four 64 KiB blocks of local store || odd
|-
| MMU || 4.75 || memory management unit
|-
| DMA || 7.5 || direct memory access unit
|-
| BIUGPR || 9.{{0}}6.25% || busgeneral purpose interfaceregister unitfile
|-
| RTBpermute || 2{{0}}3.525% || arraypermute built-inexecution testunit block (ABIST)|| odd
|-
| ATObranch || 1{{0}}2.65% || atomicbranch execution unit for atomic DMA|| updatesodd
|-
| channel || {{0}}6.75% || channel interface (three discrete blocks) || odd
| HB || 0.5 || obscure
|-
| LS0-LS3LS0–LS3 || 30.0% || four 64 KiB blocks of local store || odd
|-
| MMU || {{0}}4.75% || memory management unit
|-
| DMA || {{0}}7.5% || direct memory access unit
|-
| BIU || {{0}}9.0% || bus interface unit
|-
| RTB || {{0}}2.5% || array built-in test block (ABIST)
|-
| ATO || {{0}}1.6% || atomic unit for atomic DMA updates
|-
| HB || {{0}}0.5% || obscure
|}
<!-- OK, I see the method for aligning columns by decimal points in the table help. Not for this chicken. Some diehard can suffer or this can wait until MediaWiki is fixed properly. -->
 
Additional details concerning the internal SPE implementation have been disclosed by IBM engineers, including [[Peter Hofstee]], IBM's chief architect of the synergistic processing element, in a scholarly IEEE publication.{{ref|90nmsoi}}
Understanding the dispatch pipes is important to write efficient code. In the SPU architecture, two instructions can be dispatched (started) in each clock cycle using dispatch pipes designated ''even'' and ''odd''. The two pipes provide different execution units, as shown in the table above. As IBM partitioned this, most of the arithmetic instructions execute on the ''even'' pipe, while most of the memory instructions execute on the ''odd'' pipe. The permute unit is closely associated with memory instructions as it serves to pack and unpack data structures located in memory into the SIMD multiple operand format that the SPU computes on most efficiently.
 
This document includes a photograph of the 2.54&nbsp;mm x× 5.81 &nbsp;mm SPE, as implemented in 90-nm [[Silicon on Insulator|SOI]]. In this technology, the SPE contains 21 million transistors of which 14 million are contained in arrays (a term presumably designating register files and the local store) and 7 million transistors are logic. This photograph is overdrawn with functional unit boundaries, which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows:
Unlike other processor designs providing distinct execution pipes, each SPU instruction can only dispatch on one designated pipe. In competing designs, more than one pipe might be designed to handle extremely common instructions such as ''add'', permitting more two or more of these instructions to be executed concurrently, which can serve to increase efficiency on unbalanced workflows. In keeping with the extremely Spartan design philosophy, for the SPU no execution units are multiply provisioned.
 
Understanding the dispatch pipes is important to write efficient code. In the SPU architecture, two instructions can be dispatched (started) in each clock cycle using dispatch pipes designated ''even'' and ''odd''. The two pipes provide different execution units, as shown in the table above. As IBM partitioned this, most of the arithmetic instructions execute on the ''even'' pipe, while most of the memory instructions execute on the ''odd'' pipe. The permute unit is closely associated with memory instructions as it serves to pack and unpack data structures located in memory into the SIMD multiple operand format that the SPU computes on most efficiently.
 
Unlike other processor designs providing distinct execution pipes, each SPU instruction can only dispatch on one designated pipe. In competing designs, more than one pipe might be designed to handle extremely common instructions such as ''add'', permitting more two or more of these instructions to be executed concurrently, which can serve to increase efficiency on unbalanced workflows. In keeping with the extremely Spartan design philosophy, for the SPU no execution units are multiply provisioned.
 
Understanding the limitations of the restrictive two pipeline design is one of the key concepts a programmer must grasp to write efficient SPU code at the lowest level of abstraction. For programmers working at higher levels of abstraction, a good compiler will automatically balance pipeline concurrency where possible.
{{clear}}
 
====SPE power and performance====
{| class="wikitable" align="right" style="font-size:90%"
As tested by IBM under a heavy transformation and lighting workload [average IPC of 1.4], the performance profile of this implementation for a single SPU processor is qualified as follows:
|+ '''Relationship of speed to temperature'''
 
! Voltage (V) ||!! Frequency (GHz) ||!! Power (W) ||!! Die Temp (C) .
{| class="wikitable" style="margin: 1em auto 1em auto" align="right"
|+ '''Relationship of speed to temperature'''
! Voltage (V) || Frequency (GHz) || Power (W) || Die Temp (C)
|-
| 0.9 V || 2.0 GHz || {{0}}1 W || 25 &nbsp;°C
|-
| 0.9 V || 3.0 GHz || {{0}}2 W || 27 &nbsp;°C
|-
| 1.0 V || 3.8 GHz || {{0}}3 W || 31 &nbsp;°C
|-
| 1.1 V || 4.0 GHz || {{0}}4 W || 38 &nbsp;°C
|-
| 1.2 V || 4.4 GHz || {{0}}7 W || 47 &nbsp;°C
|-
| 1.3 V || 5.0 GHz || 11 W || 63 &nbsp;°C
|}
As tested by IBM under a heavy transformation and lighting workload [average IPC of 1.4], the performance profile of this implementation for a single SPU processor is qualified as follows:
 
The entry for 2.0&nbsp;GHz operation at 0.9 V represents a low power configuration. Other entries show the peak stable operating frequency achieved with each voltage increment. As a general rule in CMOS circuits, power dissipation rises in a rough relationship to V^{{sup|2 * }}F, the square of the voltage times the operating frequency.
 
Though the power measurements provided by the IBM authors lack precision they convey a good sense of the overall trend. These figures show the part is capable of running above 5&nbsp;GHz under test lab conditions--thoughconditions—though at a die temperature too hot for standard commercial configurations. The first Cell processors made commercially available were rated by IBM to run at 3.2&nbsp;GHz, an operating speed where this chart suggests a SPU die temperature in a comfortable vicinity of 30 degrees.
 
Note that a single SPU represents 6% of the Cell processor's die area. The power figures given in the table above represent just a small portion of the overall power budget.
 
IBM has publicly announced their intention to implement Cell on a future technology below the 90 &nbsp;nm node to improve power consumption. Reduced power consumption could ''potentially'' allow the existing design to be boosted to 5&nbsp;GHz or above without exceeding the thermal constraints of existing products.
===Future editions in CMOS===
{{clear}}
IBM has publicly announced their intention to implement Cell on a future technology below the 90 nm node to improve power consumption. Reduced power consumption could ''potentially'' allow the existing design to be boosted to 5&nbsp;GHz or above without exceeding the thermal constraints of existing products.
 
====ProspectsCell at 65 nm====
The mostfirst likelyshrink design node for a futureof Cell processorwas isat the upcoming 65nm65&nbsp;nm node in which IBM and Toshiba have already invested great sums of money. <!-- found scrap on the Inquirer that Sony and Tosh committed $190 m to achieve some advanced generation ahead of schedule but never reported back on progress, not worth referencing formally sooo it becomes "great [unsubstantiated] sums of money" --> All things remaining equal, aThe reduction to 65 &nbsp;nm would reducereduced the existing 230 &nbsp;mm²<sup>2</sup> die based on the 90 &nbsp;nm process to half its current size, about 120 &nbsp;mm²<sup>2</sup>, greatly reducing IBM's manufacturing cost as well.
 
On 12th of12 March 2007, IBM announced that it started producing 65nm65&nbsp;nm Cells in its East Fishkill fab. The chips produced there are apparently only for IBMs own Cell [[Computing blade|blade]] servers, awhich timeframewere forthe integrationfirst to get the 65&nbsp;nm Cells. Sony introduced the third generation of thesethe chipsPS3 intoin November 2007, the 40GB model without PS2-compatibility which was [[Playstationhttps://www.engadget.com/2007/10/30/40gb-ps3-features-65nm-chips-lower-power-consumption/ 3]confirmed] hasto notuse yetthe been65&nbsp;nm announcedCell. Thanks to the shrunk Cell, power consumption was reduced from 200{{nbsp}}W to 135{{nbsp}}W.
IBMs news release is scarce on technical details. So far it is only known that these 65nm-Cells clock up to 6&nbsp;GHz and run on 1.3V core voltage, as [http://news.spong.com/article/11413?cb=936 demonstrated] on the [[ISSCC]] 2007. This would give the chip a theoretical peak performance of 384 GLFOPS in single precision, a significant improvement to the 204.8 GFLOPS peak that a 90nm 3.2&nbsp;GHz Cell could provide with 8 active SPUs. IBM further announced it implemented new power-saving features and a dual power supply for the SRAM array. Further improvements remain shady so far, but this version is not yet the rumoured "Cell+" with enhanced Double Precision floating point performance, which is still scheduled for 2008 according to the [http://www-5.ibm.com/at/symposium/pdf/00_Collaborative_Innovation_and_the_Cell_Broadband_Engine.pdf Roadmap].
 
At first it was only known that the 65&nbsp;nm-Cells clock up to 6&nbsp;GHz and run on 1.3{{nbsp}}V core voltage, as [http://news.spong.com/article/11413?cb=936 demonstrated] on the [[ISSCC]] 2007. This would have given the chip a theoretical peak performance of 384{{nbsp}}GFLOPS in FP8 quarter precision (48{{nbsp}}GFLOPs in FP64 dual precision), a significant improvement to the 204.8{{nbsp}}GFLOPS peak (25.6{{nbsp}}GFLOPs FP64 dual precision) that a 90&nbsp;nm 3.2&nbsp;GHz Cell could provide with 8 active SPUs. IBM further announced it implemented new power-saving features and a dual power supply for the SRAM array. This version was not yet the long-rumoured "Cell+" with enhanced Double Precision floating point performance, which first saw the light of day mid-2008 in the [[IBM Roadrunner|Roadrunner supercomputer]] in the form of [[QS22#Cell based|QS22]] PowerXCell blades. Although IBM talked about and even showed higher-clocked Cells before, clock speed has remained constant at 3.2&nbsp;GHz, even for the double precision enabled "Cell+" of the Roadrunner. By keeping clockspeed constant, IBM has instead opted to reduce power consumption. PowerXCell clusters even best IBMs [[Blue Gene]] clusters (371{{nbsp}}MFLOPS/watt), which are far more power-efficient already than clusters made up of conventional CPUs (265{{nbsp}}MFLOPS/watt and lower).
So far this seems to be a pretty straightforward [[Die (integrated circuit)|die]]-shrink, as the size of the Local Store RAM and number of SPUs remain the same. This chip should significantly reduce power consumption and be cheaper to produce thanks to the much smaller die-size.
 
===Future editions in CMOS===
IBM could elect to partially redesign the chip to take advantage of additional silicon area in future revisions. The Cell architecture already makes explicit provisions for the size of the local store to vary across implementations. A chip-level interface is available to the programmer to determine local store capacity, which is always an exact binary power.
 
====Prospects at 45 nm====
Based on the reported die area of 30% for the local store in the 90 nm edition, it would be feasible to double the local store to 512&nbsp;KiB per SPU leaving the total die area devoted to the SPU processors roughly unchanged. In this scenario, the SPU area devoted to the local store would increase to 60% while other areas shrink by half. Going this route would reduce heat, and increase performance on memory intensive workloads, but without yielding IBM much if any reduction in cost of manufacture.
At ISSCC 2008, IBM [https://arstechnica.com/news.ars/post/20080207-ibm-shrinks-cell-to-45nm-cheaper-ps3s-will-follow.html announced] Cell at the 45&nbsp;nm node. IBM said it would require 40 percent less power at the same clockspeed than its 65&nbsp;nm predecessor and that the die area would shrink by 34 percent. The 45&nbsp;nm Cell requires less cooling and allows for cheaper production, also through the use of a much smaller heatsink. Mass production was initially slotted to begin in late 2008 but was moved to [https://www.engadget.com/2008/09/22/sony-and-toshiba-to-begin-mass-producing-45nm-cell-processor-in/ early 2009].
 
====Prospects beyond 6545 nm====
Sony, IBM and Toshiba [https://www.theregister.co.uk/2006/01/12/ibm_sony_toshiba_32nm_cell/ announced] to begin work on a Cell as small as 32&nbsp;nm in January 2006, but since process shrinks in fabs usually happen on a global and not an individual chip scale, this was merely as a public commitment to take Cell to 32&nbsp;nm.
Process technologies below 65 nm capable of implementing a Cell processor have not been demonstrated. For any number of reasons dictated by technology or market, IBM might elect to discontinue the Cell technology without achieving these nodes. That said, IBM and Sony have made a substantial investment in the Cell technology and such a large investment will normally be realized over several generations of new process technology.
 
==References==
At this stage, the Sony Toshiba IBM alliance (STI) have announced their intention to continue to work together and share innovation beyond their current venture at 65 nm to the 45 nm and 32 nm process nodes{{ref|sti32nm}}, but they have not mentioned Cell for implementation by name in either of these nodes, though if Cell becomes greatly successful it would be surprising if subsequent Cell editions in these nodes were not someday forthcoming.
{{Reflist}}
 
{{Cell microprocessor segments}}
 
[[Category:Cell BE architecture]]