Systolic array: Difference between revisions

Content deleted Content added
Examples: Bubble sort.
EMJzero (talk | contribs)
m Fixed wrong hyperlink.
 
(8 intermediate revisions by 2 users not shown)
Line 1:
{{Short description|Type of parallel computing architecture of tightly coupled nodes}}
{{Use dmy dates|date=October 2019}}
In [[parallel computing|parallel]] [[computer architectures]], a '''systolic array''' is a homogeneous [[Graph (discrete mathematics)|network]] of tightly coupled [[data processing unit]]s (DPUs) called cells or [[Node (computer science)|node]]s. Each node or DPU independently computes a partial result as a function of the data received from its upstream neighbours, stores the result within itself and passes it downstream. Systolic arrays were first used in [[Colossus computer|Colossus]], which was an early computer used to break German [[Lorenz cipher|Lorenz]] ciphers during [[World War II]].<ref>{{YouTube |id=g2tMcMQqSbA |title=Colossus - The Greatest Secret in the History of Computing |time=41m45s}}</ref> Due to the classified nature of Colossus, they were independently invented or rediscovered by [[H. T. Kung]] and [[Charles Leiserson]] who described arrays for many dense linear algebra computations (matrix product, solving systems of [[linear equation]]s, [[LU decomposition]], etc.) for banded matrices. Early applications include computing [[greatest common divisor]]s of integers and polynomials.<ref>{{Cite web | url=http://www.eecs.harvard.edu/~htk/publication/1984-ieeetoc-brent-kung.pdf | title=Systolic VLSI Arrays for Polynomial GCD Computation | first1=Richard P. | last=Brent | first2=H.T. | last2=Kung | website=www.eecs.harvard.edu | date=August 1984}}</ref> Nowdays, they can be found in [[Neural processing unit|NPUs]] and [[hardware accelerator]]s based on [[spatial architecture|spatial designs]]. They are sometimes classified as [[Multiple instruction, single data|multiple-instruction single-data]] (MISD) architectures under [[Flynn's taxonomy]], but this classification is questionable because a strong argument can be made to distinguish systolic arrays from any of Flynn's four categories: [[Single instruction, single data|SISD]], [[Single instruction, multiple data|SIMD]], [[Multiple instruction, single data|MISD]], [[Multiple instruction, multiple data|MIMD]], as discussed later in this article.
 
The parallel input [[data]] flows through a network of hard-wired [[Microprocessor|processor]] nodes, which combine, process, [[merge algorithm|merge]] or [[sorting algorithm|sort]] the input data into a derived result. Because the [[wave]]-like propagation of data through a systolic array resembles the [[pulse]] of the human circulatory system, the name ''systolic'' was coined from medical terminology. The name is derived from [[systole]] as an analogy to the regular pumping of blood by the heart.
 
==Applications==
Systolic arrays are often hard-wired for specific operations, such as "[[Multiply–accumulate operation|multiply and accumulate"]], to perform massively [[parallel computing|parallel]] integration, [[convolution]], [[correlation]], [[matrix multiplication]] or data sorting tasks. They are also used for [[dynamic programming]] algorithms, used in DNA and protein [[sequence analysis]].
 
==Architecture==
Line 97:
See <ref name=":0" /> Figure 12 for an algorithm that performs on-the-fly [[Least squares|least-squares]] using one- and two-dimensional systolic arrays.
 
=== Sorting ===
[[Bubble sort]] is also an example of 1D systolic computation,<ref>C.L. Britton, Jr., M.N. Ericson, and D.W. Bouldin, "A Virtual Zero-Time, Monolithic Systolic Sorting Array", 1989 https://www.osti.gov/servlets/purl/6004774</ref> although it applies N-1 passes for an array of size N. Each pass systolically moves the maximum element of a subsequence towards its final ___location in the sorted result.
 
If one is willing to use N/2 processing elements (PE) each with a comparator and two registers, elements arranged in a stack-like fashion, an array (or stream) of size N can thus be sorted in 2N time by pushing its elements in while on every level of the systolic stack the maximum of the pair of elements stored in each PE is pushed further down. And after all the elements are pushed in, the process is reversed with the minimum element in each PE being popped out (or "pushed up"), resulting in the stream of elements coming out sorted in ascending order.<ref>M. H. Alsuwaiyel, *Parallel Algorithms*, World Scientific, 2022, Sec. 9.5 "An On-chip Bubble Sorter" (in Ch. 9 "Systolic Computation").</ref>
 
Sorting input arrays of larger size (N > P) than the number of processing elements (P) is somewhat complex to do efficiently with such a system, but can be realized (by adding an external serial processor) in O(N log N/log P) time. The serial processor needs to manage a "bucket B-tree", where each node in the [[B-tree]] has P "buckets" that are eventually each sorted in O(P) time using the PEs.<ref>Mikhail J. Atallah, Greg N. Frederickson, S. Rao Kosaraju, "Sorting with efficient use of special-purpose sorters", Information Processing Letters 1988 https://doi.org/10.1016/0020-0190(88)90075-0; also as Purdue CSD-TR 87-695 https://docs.lib.purdue.edu/cstech/602/</ref>
 
==Implementations==
* Inmos [[Transputer]]<ref name="HSP">"Systolic Arrays" in *Handbook of Signal Processing Systems* (3rd ed.), 2018
https://link.springer.com/chapter/10.1007/978-3-319-91734-4_26</ref>
* [[Cisco]] PXF network processor is internally organized as systolic array.<ref>{{cite web|title=Cisco 10000 Series Router Performance Routing Engine Installation|url=https://www.cisco.com/en/US/products/hw/routers/ps133/prod_installation_guide09186a0080525aba.html#wp48065|access-date=3 August 2020}}</ref>
* Google’s [[Tensor Processing Unit|TPU]] is also designed around a systolic array.
Line 106 ⟶ 112:
* Paracel FDF4G GeneMatcher Biological (DNA and Protein) search system
* Inferentia chip at [[Amazon Web Services]]<ref>{{cite web|title=Announcing availability of Inf1 instances in Amazon SageMaker for high performance and cost-effective machine learning inference|date=14 August 2020 |url=https://aws.amazon.com/blogs/aws/amazon-ecs-now-supports-ec2-inf1-instances/|access-date=15 August 2020}}</ref>
* [[Gemmini]] systolic array-based accelerator developed at [[UC Berkeley]]<ref>{{cite book|last1=Genc|first1=Hasan|last2=Kim|first2=Seah|last3=Amid|first3=Alon|last4=Haj-Ali|first4=Ameer|last5=Iyer|first5=Vighnesh|last6=Prakash|first6=Pranav|last7=Zhao|first7=Jerry|last8=Grubb|first8=Daniel|last9=Liew|first9=Harrison|last10=Mao|first10=Howard|last11=Ou|first11=Albert|last12=Schmidt|first12=Colin|last13=Steffl|first13=Samuel|last14=Wright|first14=John|last15=Stoica|first15=Ion|last16=Ragan-Kelley|first16=Jonathan|last17=Asanovic|first17=Krste|last18=Nikolic|first18=Borivoje|last19=Shao|first19=Yakun Sophia|chapter=Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration |title=2021 58th ACM/IEEE Design Automation Conference (DAC)|year=2021|volume=|number=|pages=769–774|doi=10.1109/DAC18074.2021.9586216|arxiv=1911.09925 |isbn=978-1-6654-3274-0 }}</ref>
*[[MIT Eyeriss]] is a systolic array accelerator for convolutional neural networks.<ref>{{Cite web|title=Eyeriss Project|url=http://eyeriss.mit.edu/|access-date=2021-02-21|website=eyeriss.mit.edu}}</ref><ref>{{Cite journal|last1=Chen|first1=Yu-Hsin|last2=Emer|first2=Joel|last3=Sze|first3=Vivienne|author3-link=Vivienne Sze|date=2016-10-12|title=Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks|url=https://dl.acm.org/doi/10.1145/3007787.3001177|journal=ACM SIGARCH Computer Architecture News|language=en|volume=44|issue=3|pages=367–379|doi=10.1145/3007787.3001177|s2cid=3291270 |issn=0163-5964|url-access=subscription}}</ref>
 
==See also==
Line 113 ⟶ 119:
* [[WARP (systolic array)]] – systolic array computer, GE/CMU
* [[Tensor Processing Unit]] – [[AI]] accelerator [[ASIC]]
* [[Spatial architecture]] - class of computer architectures encompassing systolic arrays
 
==Notes==