Systolic array: Difference between revisions

Content deleted Content added
Add how/why systolic arrays had to be rediscovered
EMJzero (talk | contribs)
m Fixed wrong hyperlink.
 
(35 intermediate revisions by 22 users not shown)
Line 1:
{{Short description|Type of parallel computing architecture of tightly coupled nodes}}
{{Use dmy dates|date=October 2019}}
In [[parallel computing|parallel]] [[computer architectures]], a '''systolic array''' is a homogeneous [[Graph (discrete mathematics)|network]] of tightly coupled [[data processing unit]]s (DPUs) called cells or [[Node (computer science)|node]]s. Each node or DPU independently computes a partial result as a function of the data received from its upstream neighborsneighbours, stores the result within itself and passes it downstream. Systolic arrays were first used in [[Colossus computer|Colossus]], which was an early computer used to break German [[Lorenz cipher|Lorenz]] ciphers during [[World War II]].<ref>{{YouTube |id=g2tMcMQqSbA |title=Colossus - The Greatest Secret in the History of Computing |time=41m45s}}</ref> Due to the classified nature of Colossus, they were independently invented or rediscovered by [[H. T. Kung]] and [[Charles Leiserson]] who described arrays for many dense linear algebra computations (matrix product, solving systems of [[linear equation]]s, [[LU decomposition]], etc.) for banded matrices. Early applications include computing [[greatest common divisor]]s of integers and polynomials.<ref>{{Cite web | url=http://www.eecs.harvard.edu/~htk/publication/1984-ieeetoc-brent-kung.pdf | title=Systolic VLSI Arrays for Polynomial GCD Computation | first1=Richard P. | last=Brent | first2=H.T. | last2=Kung | website=www.eecs.harvard.edu | date=August 1984}}</ref> Nowdays, they can be found in [[Neural processing unit|NPUs]] and [[hardware accelerator]]s based on [[spatial architecture|spatial designs]]. They are sometimes classified as [[MISDMultiple instruction, single data|multiple-instruction single-data]] (MISD) architectures under [[Flynn's taxonomy]], but this classification is questionable because a strong argument can be made to distinguish systolic arrays from any of Flynn's four categories: [[Single instruction, single data|SISD]], [[Single instruction, multiple data|SIMD]], [[Multiple instruction, single data|MISD]], [[Multiple instruction, multiple data|MIMD]], as discussed later in this article.
 
The parallel input [[data]] flows through a network of hard-wired [[Microprocessor|processor]] nodes, which combine, process, [[merge algorithm|merge]] or [[sorting algorithm|sort]] the input data into a derived result. Because the [[wave]]-like propagation of data through a systolic array resembles the [[pulse]] of the human circulatory system, the name ''systolic'' was coined from medical terminology. The name is derived from [[systole]] as an analogy to the regular pumping of blood by the heart.
 
==Applications==
Systolic arrays are often hard-wired for specific operations, such as "[[Multiply–accumulate operation|multiply and accumulate"]], to perform massively [[parallel computing|parallel]] integration, [[convolution]], [[correlation]], [[matrix multiplication]] or data sorting tasks. They are also used for [[dynamic programming]] algorithms, used in DNA and protein [[sequence analysis]].
 
==Architecture==
A systolic array typically consists of a large [[monolithic system|monolithic]] [[Graph (discrete mathematics)|network]] of primitive computing [[Node (computer science)|node]]s which can be hardwired or software configured for a specific application. The nodes are usually fixed and identical, while the interconnect is programmable. The more general '''wavefront''' processors, by contrast, employ sophisticated and individually programmable nodes which may or may not be monolithic, depending on the array size and design parameters. The other distinction is that systolic arrays rely on [[synchronous]] data transfers, while [[wavefront]] tend to work [[wikt:asynchronous|asynchronous]]ly.
 
Unlike the more common [[Von Neumann architecture]], where program execution follows a script of instructions stored in common memory, [[address space|addressed]] and sequenced under the control of the [[CPU]]'s [[program counter]] (PC), the individual nodes within a systolic array are triggered by the arrival of new data and always process the data in exactly the same way. The actual processing within each node may be hard wired or block [[microcode|micro code]]d, in which case the common node personality can be block programmable.
 
The systolic array paradigm with data-streams driven by data [[Counter (digital)|counter]]s, is the counterpart of the Von Neumann architecture with instruction-stream driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports [[data parallelism]].
Line 17 ⟶ 18:
A major benefit of systolic arrays is that all operand data and partial results are stored within (passing through) the processor array. There is no need to access external buses, main memory or internal caches during each operation as is the case with Von Neumann or [[Harvard architecture|Harvard]] sequential machines. The sequential limits on [[parallel computing|parallel]] performance dictated by [[Amdahl's Law]] also do not apply in the same way, because data dependencies are implicitly handled by the programmable [[Node (computer science)|node]] interconnect and there are no sequential steps in managing the highly parallel data flow.
 
Systolic arrays are therefore extremely good at artificial intelligence, image processing, pattern recognition, computer vision and other tasks whichthat animal brains do so particularly well. Wavefront processors in general can also be very good at machine learning by implementing self configuring neural nets in hardware.
 
==Classification controversy==
While systolic arrays are officially classified as [[Multiple instruction, single data|MISD]], their classification is somewhat problematic. Because the input is typically a vector
of independent values, the systolic array is definitely not [[Single instruction, single data|SISD]]. Since these [[input (computer science)|input]] values are merged and combined into the result(s) and do not maintain their [[independence]] as they would in a [[Single instruction, multiple data|SIMD]] vector processing unit, the [[array data structure|array]] cannot be classified as such. Consequently, the array cannot be classified as a [[Multiple instruction, multiple data|MIMD]] either, because MIMD can be viewed as a mere collection of smaller SISD and [[Single instruction, multiple data|SIMD]] machines.
 
Finally, because the data [[swarm]] is transformed as it passes through the array from [[Node (computer science)|node]] to node, the multiple nodes are not operating on the same data, which makes the MISD classification a [[misnomer]]. The other reason why a systolic array should not qualify as a '''MISD''' is the same as the one which disqualifies it from the SISD category: The input data is typically a vector not a '''s'''ingle '''d'''ata value, although one could argue that any given input vector is a single item of data.
 
In spite of all of the above, systolic arrays are often offered as a classic example of MISD architecture in textbooks on [[parallel computing]] and in engineering classes. If the array is viewed from the outside as [[atomic operation|atomic]] it should perhaps be classified as '''SFMuDMeR''' = Singlesingle Functionfunction, Multiplemultiple Datadata, Mergedmerged Resultresult(s).
 
Systolic arrays use a pre-defined computational flow graph that connects their nodes. [[Kahn process networks]] use a similar flow graph, but are distinguished by the nodes working in lock-step in the systolic array: in a Kahn network, there are FIFO queues
Line 33 ⟶ 34:
A systolic array is composed of matrix-like rows of [[data processing unit]]s called cells. Data processing units (DPUs) are similar to [[central processing unit]]s (CPUs), (except for the usual lack of a [[program counter]],<ref>The Paracel GeneMatcher series of systolic array processors do have a [[program counter]]. More complicated algorithms are implemented as a series of simple steps, with shifts specified in the instructions.</ref> since operation is [[transport triggered architecture|transport-triggered]], i.e., by the arrival of a data object). Each cell shares the information with its neighbors immediately after processing. The systolic array is often rectangular where data flows across the array between neighbour [[data processing unit|DPU]]s, often with different data flowing in different directions. The data streams entering and leaving the ports of the array are generated by auto-sequencing memory units, ASMs. Each ASM includes a data [[Counter (digital)|counter]]. In [[embedded system]]s a data stream may also be input from and/or output to an external source.
 
{{multiple image
An example of a systolic [[algorithm]] might be designed for [[matrix multiplication]]. One [[matrix (math)|matrix]] is fed in a row at a time from the top of the array and is passed down the array, the other matrix is fed in a column at a time from the left hand side of the array and passes from left to right. Dummy values are then passed in until each processor has seen one whole row and one whole column. At this point, the result of the multiplication is stored in the array and can now be output a row or a column at a time, flowing down or across the array.<ref>[http://web.cecs.pdx.edu/~mperkows/temp/May22/0020.Matrix-multiplication-systolic.pdf Systolic Array Matrix Multiplication]</ref>
| width = 380
| direction = vertical
| header = Examples of 2x2 Matrix Multiplication in Systolic Array
| image1 = Output Stationary Systolic Array Example.png
| caption1 = Systolic array algorithm accumulating output values inside [[data processing unit|DPU]]s.
| image2 = Weights Stationary Systolic Array Example.png
| caption2 = Systolic array algorithm pre-loading and keeping one operand stationary inside [[data processing unit|DPU]]s while computing. In the example, the green matrix is pre-loaded in the array and can be reused for subsequent multiplications.
}}
 
An example of a systolic [[algorithm]] might be designed for [[matrix multiplication]]. One [[matrix (mathmathematics)|matrix]] is fed in a row at a time from the top of the array and is passed down the array, the other matrix is fed in a column at a time from the left hand side of the array and passes from left to right. Dummy values are then passed in until each processor has seen one whole row and one whole column. At this point, the result of the multiplication is stored in the array and can now be output a row or a column at a time, flowing down or across the array.<ref>[{{Cite web|url=http://web.cecs.pdx.edu/~mperkows/temp/May22/0020.Matrix-multiplication-systolic.pdf |title=Systolic Array Matrix Multiplication]}}</ref>
Systolic arrays are arrays of [[data processing unit|DPU]]s which are connected to a small number of nearest neighbour DPUs in a mesh-like topology. DPUs perform a sequence of operations on data that flows between them. Because the traditional systolic array synthesis methods have been practiced by algebraic algorithms, only uniform arrays with only linear pipes can be obtained, so that the architectures are the same in all DPUs. The consequence is, that only applications with regular data dependencies can be implemented on classical systolic arrays. Like [[SIMD]] machines, clocked systolic arrays compute in "lock-step" with each processor undertaking alternate compute | communicate
 
phases. But systolic arrays with asynchronous handshake between DPUs are called ''wavefront arrays''.
Systolic arrays are arrays of [[data processing unit|DPU]]s which are connected to a small number of nearest neighbour DPUs in a mesh-like topology. DPUs perform a sequence of operations on data that flows between them. Because the traditional systolic array synthesis methods have been practiced by algebraic algorithms, only uniform arrays with only linear pipes can be obtained, so that the architectures are the same in all DPUs. The consequence is, that only applications with regular data dependencies can be implemented on classical systolic arrays. Like [[Single instruction, multiple data|SIMD]] machines, clocked systolic arrays compute in "lock-step" with each processor undertaking alternate compute | communicate phases. But systolic arrays with asynchronous handshake between DPUs are called ''wavefront arrays''.
One well-known systolic array is Carnegie Mellon University's [[iWarp]] processor, which has been manufactured by Intel. An iWarp system has a linear array processor connected by data buses going in both directions.
 
Line 42 ⟶ 52:
Systolic arrays (also known as ''wavefront processors''), were first described by [[H. T. Kung]] and [[Charles E. Leiserson]], who published the first paper describing systolic arrays in 1979. However, the first machine known to have used a similar technique was the [[Colossus computer|Colossus Mark II]] in 1944.
 
==Examples==
==Application example==
;=== Polynomial evaluation ===
{{Expand section|date=May 2016}}
 
;Polynomial evaluation
 
[[Horner's rule]] for evaluating a polynomial is:
 
Line 55 ⟶ 62:
A linear systolic array in which the processors are arranged in pairs: one multiplies its input by <math>x</math> and passes the result to the right, the next adds <math>a_j</math> and passes the result to the right.
 
=== Convolution ===
==Advantages and disadvantages==
 
{{Unreferenced section|date=December 2016}}
Consider a chain of processing elements (PEs), each performing a [[Multiply–accumulate operation|multiply-accumulate operation]]. It processes input data (<math>x_i</math>) and weights (<math>w_i</math>) systolically, meaning data flows through the array in a regular, rhythmic manner. The weights remain stationary within each PE, while the input data and partial sums (<math>y_i</math>) move in opposite directions.
Pros
 
* Faster than general purpose processors
Each PE performs the following operation:<math display="block">
* Scalable
\begin{aligned}
Cons
y_{out} &= y_{in} + w \cdot x_{in} \\
* Expensive, due to low economy of scale
x_{out} &= x_{in}
* Highly specialized, custom hardware is required and often application specific.
\end{aligned}
* Not widely implemented
</math>where:
* Limited code base of programs and algorithms. (Not all algorithms can be implemented as systolic arrays. Often tricks are needed to map such algorithms on to a systolic array.)
 
* <math>x_{in}</math> is the input data.
* <math>y_{in}</math> is the incoming partial sum.
* <math>w</math> is the weight stored in the PE.
* <math>x_{out}</math> is the output data (passed to the next PE).
* <math>y_{out}</math> is the updated partial sum.
 
From the left, the input stream is <math>
\dots, x_3, 0, x_2, 0, x_1
</math>, and from the right, the output stream is <math>
y_1, y_2, y_3, \dots
</math>. If <math>
y_1, x_1
</math> enter the rightmost PE simultaneously, then the leftmost PE outputs<math display="block">
\begin{aligned}
y_1 &= w_1 x_1 + w_2 x_2 + w_3 x_3 + \cdots \\
y_2 &= w_1 x_2 + w_2 x_3 + w_3 x_4 + \cdots \\
&\vdots
\end{aligned}
</math>This is the 1-dimensional convolution. Similarly, n-dimensional convolution can be computed by an n-dimensional array of PEs.
 
Many other implementations of the 1D convolutions are available, with different data flows.<ref name=":0">{{Cite journal |last=Kung |date=January 1982 |title=Why systolic architectures? |url=https://www.semanticscholar.org/paper/Why-systolic-architectures-Kung/256bbe8e9fa3f5b72c24f1037ab734f9e7dd01c4 |journal=Computer |volume=15 |issue=1 |pages=37–46 |doi=10.1109/MC.1982.1653825 |issn=0018-9162}}</ref>
 
See <ref name=":0" /> Figure 12 for an algorithm that performs on-the-fly [[Least squares|least-squares]] using one- and two-dimensional systolic arrays.
 
=== Sorting ===
[[Bubble sort]] is also an example of 1D systolic computation,<ref>C.L. Britton, Jr., M.N. Ericson, and D.W. Bouldin, "A Virtual Zero-Time, Monolithic Systolic Sorting Array", 1989 https://www.osti.gov/servlets/purl/6004774</ref> although it applies N-1 passes for an array of size N. Each pass systolically moves the maximum element of a subsequence towards its final ___location in the sorted result.
 
If one is willing to use N/2 processing elements (PE) each with a comparator and two registers, elements arranged in a stack-like fashion, an array (or stream) of size N can thus be sorted in 2N time by pushing its elements in while on every level of the systolic stack the maximum of the pair of elements stored in each PE is pushed further down. And after all the elements are pushed in, the process is reversed with the minimum element in each PE being popped out (or "pushed up"), resulting in the stream of elements coming out sorted in ascending order.<ref>M. H. Alsuwaiyel, *Parallel Algorithms*, World Scientific, 2022, Sec. 9.5 "An On-chip Bubble Sorter" (in Ch. 9 "Systolic Computation").</ref>
 
Sorting input arrays of larger size (N > P) than the number of processing elements (P) is somewhat complex to do efficiently with such a system, but can be realized (by adding an external serial processor) in O(N log N/log P) time. The serial processor needs to manage a "bucket B-tree", where each node in the [[B-tree]] has P "buckets" that are eventually each sorted in O(P) time using the PEs.<ref>Mikhail J. Atallah, Greg N. Frederickson, S. Rao Kosaraju, "Sorting with efficient use of special-purpose sorters", Information Processing Letters 1988 https://doi.org/10.1016/0020-0190(88)90075-0; also as Purdue CSD-TR 87-695 https://docs.lib.purdue.edu/cstech/602/</ref>
 
==Implementations==
* Inmos [[Transputer]]<ref name="HSP">"Systolic Arrays" in *Handbook of Signal Processing Systems* (3rd ed.), 2018
https://link.springer.com/chapter/10.1007/978-3-319-91734-4_26</ref>
* [[Cisco]] PXF network processor is internally organized as systolic array.<ref>{{cite web|title=Cisco 10000 Series Router Performance Routing Engine Installation|url=https://www.cisco.com/en/US/products/hw/routers/ps133/prod_installation_guide09186a0080525aba.html#wp48065|access-date=3 August 2020}}</ref>
* Google’s [[Tensor Processing Unit|TPU]] is also designed around a systolic array.
* Paracel FDF4T TestFinder text search system<ref name="FDF4">{{cite web|title=About Paracel|url=http://brandprosgroup.com/pages/first/websites/paracel/data/html/about_paracel2.html|website=brandprosgroup.com|publisher=Paracel|access-date=4 May 2018}}</ref>
* Paracel FDF4G GeneMatcher Biological (DNA and Protein) search system
* Inferentia chip at [[Amazon Web Services]]<ref>{{cite web|title=Announcing availability of Inf1 instances in Amazon SageMaker for high performance and cost-effective machine learning inference|date=14 August 2020 |url=https://aws.amazon.com/blogs/aws/amazon-ecs-now-supports-ec2-inf1-instances/|access-date=15 August 2020}}</ref>
* [[Gemmini]] systolic array-based accelerator developed at [[UC Berkeley]]<ref>{{cite book|last1=Genc|first1=Hasan|last2=Kim|first2=Seah|last3=Amid|first3=Alon|last4=Haj-Ali|first4=Ameer|last5=Iyer|first5=Vighnesh|last6=Prakash|first6=Pranav|last7=Zhao|first7=Jerry|last8=Grubb|first8=Daniel|last9=Liew|first9=Harrison|last10=Mao|first10=Howard|last11=Ou|first11=Albert|last12=Schmidt|first12=Colin|last13=Steffl|first13=Samuel|last14=Wright|first14=John|last15=Stoica|first15=Ion|last16=Ragan-Kelley|first16=Jonathan|last17=Asanovic|first17=Krste|last18=Nikolic|first18=Borivoje|last19=Shao|first19=Yakun Sophia|chapter=Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration |title=2021 58th ACM/IEEE Design Automation Conference (DAC)|year=2021|volume=|number=|pages=769–774|doi=10.1109/DAC18074.2021.9586216|arxiv=1911.09925 |isbn=978-1-6654-3274-0 }}</ref>
*[[MIT Eyeriss]] is a systolic array accelerator for convolutional neural networks.<ref>{{Cite web|title=Eyeriss Project|url=http://eyeriss.mit.edu/|access-date=2021-02-21|website=eyeriss.mit.edu}}</ref><ref>{{Cite journal|last=Chen|first=Yu-Hsin|last2=Emer|first2=Joel|last3=Sze|first3=Vivienne|date=2016-10-12|title=Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks|url=https://dl.acm.org/doi/10.1145/3007787.3001177|journal=ACM SIGARCH Computer Architecture News|language=en|volume=44|issue=3|pages=367–379|doi=10.1145/3007787.3001177|issn=0163-5964}}</ref>
 
==See also==
* [[Multiple instruction, single data|MISD]] – Multiplemultiple Instructioninstruction Singlesingle Datadata, Exampleexample: Systolicsystolic Arraysarrays
* [[iWarp]] – Systolicsystolic Arrayarray Computercomputer, VLSI, Intel/CMU
* [[WARP (systolic array)]] – Systolicsystolic Arrayarray Computercomputer, GE/CMU
* [[Tensor Processing Unit]] – [[AI]] accelerator [[ASIC]]
* [[Spatial architecture]] - class of computer architectures encompassing systolic arrays
 
==Notes==
Line 90 ⟶ 131:
 
==External links==
* {{cite journal | last1=Dorband|first1=Ernst Nils|last2=Hemsendorf|first2=Marc|last3=Merritt|first3=David|author3-link = David Merritt| title = Systolic and hyper-systolic algorithms for the gravitational n-body problem, with an application to Brownian motion| journal = Journal of Computational Physics| volume = 185|issue=2| pages = 484–511| date = March 2003|url=https://www.sciencedirect.com/science/article/abs/pii/S0021999102000670}}
* [http://www.iti.fh-flensburg.de/lang/papers/isa/index.htm ''Instruction Systolic Array (ISA)'']
* [httphttps://ieeexplore.ieee.org/iel5stamp/92/4292150/04292156stamp.pdfjsp?arnumber=4292156 'A VLSI Architecture for Image Registration in Real Time' (Based on systolic array), Vol. 15, September 2007]
 
{{DEFAULTSORT:Systolic Array}}