Reduce (parallel pattern): Difference between revisions

Content deleted Content added
Flogr (talk | contribs)
Flogr (talk | contribs)
No edit summary
Line 57:
The binomial tree and the pipeline both have their advantages and disadvantages, depending on the values of <math>T_{start}</math> and <math>T_{byte}</math> for the parallel communication. While the binomial tree algorithm is better suited for small vectors, the pipelined algorithm profits from a distribution of the elements to fewer processing units with more elements contained in one vector. Both approaches can be combined into one algorithm<ref>{{Cite journal|last=Sanders|first=Peter|last2=Sibeyn|first2=Jop F|title=A bandwidth latency tradeoff for broadcast and reduction|url=https://doi.org/10.1016/S0020-0190(02)00473-8|journal=Information Processing Letters|volume=86|issue=1|pages=33–38|doi=10.1016/s0020-0190(02)00473-8}}</ref> which uses a tree as its underlying communication pattern and splits the computation of the operator into pieces at the same time. Instead of the binomial tree, a Fibonacci tree is used which has the property that the height of the trees rooted at its two children differ by one. It helps to balance the load on all processing units as each unit can only evaluate one operator in one iteration on one of its elements, but it has two child-processors it receives values from.
 
=== Algorithm description ===
The animation shows the execution of such an algorithm in a full-duplex communication model. The first frame shows the Fibonacci-tree which describes the communication links that can be used, afterwards blue arrows indicate the transmission of elements. An element that is received by a processor is added to the already existing element at the same index of the vector. A processing node is depicted by three neighboring boxes with elements inside and receives elements from its two children in turn (assuming there are children with valid values). The runtime is <math>T(N,p,m) \approx (\frac{N}{m}T_{byte} + T_{start})(d + 2m - 2)</math>, where <math>d = log_{\phi}(p)</math> is the height of the tree and <math>\phi = \frac{1 + \sqrt{5}}{2}</math> the golden ratio.
The animation shows the execution of such an algorithm in a full-duplex communication model. Communication links are represented by black lines between the vectors of elements and build a Fibonacci tree of size seven in this example. When an element is send to another processing unit the link is colored with the color of the corresponding element. An element that is received by a processor is added to the already existing element of same color (at the same index in the vector).
 
The algorithm itself
 
The runtime is <math>T(N,p,m) \approx (\frac{N}{m}T_{byte} + T_{start})(d + 2m - 2)</math>, where <math>d = log_{\phi}(p)</math> is the height of the tree and <math>\phi = \frac{1 + \sqrt{5}}{2}</math> the golden ratio.
 
== Applications ==