Content deleted Content added
m Fixing broken anchor: #Cell based Blades→most alike anchor IBM BladeCenter#Cell based |
|||
(32 intermediate revisions by 26 users not shown) | |||
Line 1:
{{context|date=January 2020}}
'''Cell''' microprocessors are multi-core processors that use cellular architecture for high performance distributed computing. The first commercial [[Cell microprocessor]], the Cell BE, was designed for the Sony PlayStation 3. IBM designed the PowerXCell 8i for use in the [[Roadrunner supercomputer]].<ref>
Kevin J. Barker, Kei Davis, Adolfy Hoisie, Darren J. Kerbyson, Mike Lang, Scott Pakin, Jose C. Sancho.
[https://www.academia.edu/4460100/Entering_the_petaflop_era_the_architecture_and_performance_of_Roadrunner "Entering the Petaflop Era:The Architecture and Performance of Roadrunner"].
</ref>
==Implementation==
{{Cell microprocessor segments}}▼
===First edition Cell on 90 nm CMOS===
IBM has published information concerning two different versions of Cell in this process, an early engineering sample designated ''DD1'', and an enhanced version designated ''DD2'' intended for production. ▼
▲{| class="wikitable" align="left"
▲|+ '''Known Cell Variants in 90 nm Process'''
▲! Designation || Die Area || First Disclosed || Enhancement
|-
| DD1 || 221
|-
| DD2 || 235
|}
▲IBM has published information concerning two different versions of Cell in this process, an early engineering sample designated ''DD1'', and an enhanced version designated ''DD2'' intended for production.
The main enhancement in DD2 was a small lengthening of the die to accommodate a larger PPE core, which is reported to "contain more SIMD/vector execution resources"{{ref|dtwang3}}.
Some preliminary information released by IBM references the DD1 variant. As a result, some early journalistic accounts of the Cell's capabilities now differ from production hardware.
{{clear}}
====Cell floorplan==== ▼
[Powerpoint material accompanying an STI presentation given by Dr Peter Hofstee], includes a photograph of the DD2 Cell die overdrawn with functional unit boundaries which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows: ▼
<!-- I wasn't able to square-up the image when measuring dimensions off screen, since I only had area, but not W*H so these are a little more approx. than they might have been; I had W*H for the included SPU but the top four SPU measure differently than the bottom four in the way they overdrew the boundaries, leaving me with the same problem -->
{| class="wikitable" align="right" style="
|+
! Cell function unit
|-
| XDR interface || {{0}}5.7% ||
|-
| memory controller || {{0}}4.4% ||
|-
| 512 KiB L2 cache || 10.3% ||
|-
| PPE core || 11.1% || PowerPC processor
|-
| test || {{0}}2.0% ||
|-
| EIB || {{0}}3.1% ||
|-
| SPE (each)
|-
| I/O controller || {{0}}6.6% ||
|-
| Rambus FlexIO || {{0}}5.7% ||
|}
▲
====SPE floorplan====
Additional details concerning the internal SPE implementation have been disclosed by IBM engineers, including [[Peter Hofstee]], IBM's chief architect of the synergistic processing element, in a scholarly IEEE publication.{{ref|90nmsoi}}▼
This document includes a photograph of the 2.54 x 5.81 mm SPE, as implemented in 90-nm [[SOI]]. In this technology, the SPE contains 21 million transistors of which 14 million are contained in arrays (a term presumably designating register files and the local store) and 7 million transistors are logic. This photograph is overdrawn with functional unit boundaries, which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows: ▼
<!-- if I had found a way I would have made the table smaller than the surrounding text, esp. if the surrounding text is largish, but that's asking a lot -->
{| class="wikitable" <!-- bad effects at certain Firefox sizings with style="text-align:left" -->
|+
! SPU function
|-
| single precision || 10.0% || single precision FP execution unit || even
|- ▼
|-
|
|-
|
|- ▼
| forward macro || 3.75 || feeds execution units▼
|- ▼
|- ▼
|-
|
|- ▼
| channel || 6.75 || channel interface (three discrete blocks) || odd ▼
|-
▲| forward macro || {{0}}3.75% || feeds execution units
| LS0-LS3 || 30.0 || four 64 KiB blocks of local store || odd ▼
|- ▼
| MMU || 4.75 || memory management unit ▼
|- ▼
| DMA || 7.5 || direct memory access unit▼
|-
|
|-
|
|-
|
|-
| HB || 0.5 || obscure ▼
▲| DMA || {{0}}7.5% || direct memory access unit
| BIU || {{0}}9.0% || bus interface unit
| RTB || {{0}}2.5% || array built-in test block (ABIST)
| ATO || {{0}}1.6% || atomic unit for atomic DMA updates
|}
<!-- OK, I see the method for aligning columns by decimal points in the table help. Not for this chicken. Some diehard can suffer or this can wait until MediaWiki is fixed properly. -->
▲Additional details concerning the internal SPE implementation have been disclosed by IBM engineers, including [[Peter Hofstee]], IBM's chief architect of the synergistic processing element, in a scholarly IEEE publication.{{ref|90nmsoi}}
Understanding the dispatch pipes is important to write efficient code. In the SPU architecture, two instructions can be dispatched (started) in each clock cycle using dispatch pipes designated ''even'' and ''odd''. The two pipes provide different execution units, as shown in the table above. As IBM partitioned this, most of the arithmetic instructions execute on the ''even'' pipe, while most of the memory instructions execute on the ''odd'' pipe. The permute unit is closely associated with memory instructions as it serves to pack and unpack data structures located in memory into the SIMD multiple operand format that the SPU computes on most efficiently. ▼
▲This document includes a photograph of the 2.54 mm
▲Understanding the dispatch pipes is important to write efficient code. In the SPU architecture, two instructions can be dispatched (started) in each clock cycle using dispatch pipes designated ''even'' and ''odd''. The two pipes provide different execution units, as shown in the table above. As IBM partitioned this, most of the arithmetic instructions execute on the ''even'' pipe, while most of the memory instructions execute on the ''odd'' pipe. The permute unit is closely associated with memory instructions as it serves to pack and unpack data structures located in memory into the SIMD multiple operand format that the SPU computes on most efficiently.
Unlike other processor designs providing distinct execution pipes, each SPU instruction can only dispatch on one designated pipe. In competing designs, more than one pipe might be designed to handle extremely common instructions such as ''add'', permitting more two or more of these instructions to be executed concurrently, which can serve to increase efficiency on unbalanced workflows. In keeping with the extremely Spartan design philosophy, for the SPU no execution units are multiply provisioned.
Understanding the limitations of the restrictive two pipeline design is one of the key concepts a programmer must grasp to write efficient SPU code at the lowest level of abstraction. For programmers working at higher levels of abstraction, a good compiler will automatically balance pipeline concurrency where possible.
{{clear}}
====SPE power and performance====
{| class="wikitable" align="right" style="font-size:90%"
As tested by IBM under a heavy transformation and lighting workload [average IPC of 1.4], the performance profile of this implementation for a single SPU processor is qualified as follows: ▼
▲|+ '''Relationship of speed to temperature'''
▲! Voltage (V) || Frequency (GHz) || Power (W) || Die Temp (C)
|-
| 0.9 V || 2.0 GHz || {{0}}1 W || 25
|-
| 0.9 V || 3.0 GHz || {{0}}2 W || 27
|-
| 1.0 V || 3.8 GHz || {{0}}3 W || 31
|-
| 1.1 V || 4.0 GHz || {{0}}4 W || 38
|-
| 1.2 V || 4.4 GHz || {{0}}7 W || 47
|-
| 1.3 V || 5.0 GHz || 11 W || 63
|}
▲As tested by IBM under a heavy transformation and lighting workload [average IPC of 1.4], the performance profile of this implementation for a single SPU processor is qualified as follows:
The entry for 2.0 GHz operation at 0.9 V represents a low power configuration. Other entries show the peak stable operating frequency achieved with each voltage increment. As a general rule in CMOS circuits, power dissipation rises in a rough relationship to V
Though the power measurements provided by the IBM authors lack precision they convey a good sense of the overall trend. These figures show the part is capable of running above 5 GHz under test lab
Note that a single SPU represents 6% of the Cell processor's die area. The power figures given in the table above represent just a small portion of the overall power budget.
IBM has publicly announced their intention to implement Cell on a future technology below the 90
{{clear}}
====Cell at 65 nm====
The first shrink of Cell was at the
On
At first it was only known that the
===Future editions in CMOS===
====Prospects at 45 nm====
At ISSCC 2008, IBM [
====Prospects beyond 45 nm====
Sony, IBM and Toshiba [
==References==
{{Reflist}}
▲{{Cell microprocessor segments}}
[[Category:Cell BE architecture]]
|