Reduce (parallel pattern): Difference between revisions

Content deleted Content added
Flogr (talk | contribs)
No edit summary
Flogr (talk | contribs)
Line 51:
:::: '''receive''' <math>x_{i-1}^{k+i-1}</math> '''from''' <math>p_{i-1}</math>
:::: <math>x_{i}^{k+i-1} \gets x_{i}^{k+i-1} \oplus x_{i-1}^{k+i-1}</math>
It is important to note that the send and receive operations have to be executed concurrently for the algorithm to work. The result vector is stored at <math>p_{p-1}</math> at the end. The associated animation shows an execution of the algorithm on vectors of size four with five processing units. Two steps of the animation visualize one parallel execution step. The number of steps in the parallel execution are <math>p + m -2</math>, it takes <math>p-1</math> steps until the last processing unit receives its first element and additional <math>m-1</math> until all elements are received. Therefore the runtime in the BSP-model is <math>T(N,p,m) = (T_{start} + \frac{N}{m}T_{byte})(p+m-2)</math>, assuming that <math>N</math> is the total sizenumber of a vectorelements. <math>N</math> is incorporated into the formula because it is possible to adapt the granularity in which the operator is computed. For example, a problem instance with vectors of size four can be handled by splitting the vectors into the first two and last two elements, which are always transmitted and computed together. In this case, double the volume is send each step, but the number of steps has roughly halved. Effectively, the example instance is just handled as an instance with vectors of size two by the algorithm. The runtime <math>T(p)</math> for this approach depends on the value of <math>N</math>, which can be optimized when knowingif <math>T_{start}</math> and <math>T_{byte}</math> are known. It is optimal for <math>N = \sqrt{\frac{m(p-2)T_{byte}}{T_{start}}}</math>.
 
== Pipelined tree ==
Line 60:
The animation shows the execution of such an algorithm in a full-duplex communication model. Communication links are represented by black lines between the vectors of elements and build a Fibonacci tree of size seven in this example. When an element is send to another processing unit the link is colored with the color of the corresponding element. An element that is received by a processor is added to the already existing element of same color (at the same index in the vector).
 
The algorithm itself propagates the partial sums from bottom to top until all elements are contained in the sum at the root processor on top. In the first step of the execution, the processing units which are leafs in the underlying tree send their first elements to their parent. This is similar to the send operations of the binomial tree algorithm with the key difference that the leaf units each have two more elements which have to be send and therefore do not become inactive, but can continue to send elements, which is analogous to the pipelined approach and improves efficiency. Processing units that are not leafs start to send their elements in order of the indices in the vector once they have received an element from a child. In the example they send green, blue and red elements in this order. If two processors compete to send their elements to the same processor, then the element of the right child is received first. Because of the structure of the Fibonacci tree all processors send or receive elements while the "pipeline" is filled. The pipeline is filled from the point where each unit has received an element until the leaf units have no more elements to send.
The algorithm itself
 
=== Runtime ===
The runtime is <math>T(N,p,m) \approx (\frac{N}{m}T_{byte} + T_{start})(d + 2m - 2)</math>, where <math>d = log_{\phi}(p)</math> is the height of the tree and <math>\phi = \frac{1 + \sqrt{5}}{2}</math> the golden ratio.
Each iteration of the algorithm takes at most time <math>p \cdot T_{byte} + T_{start}</math>. The height of the tree factors into the time it needs to fill the pipeline and for Fibonacci trees it is known to be about <math>h = log_{\phi}p</math> where <math>\phi = \frac{1 + \sqrt{5}}{2}</math> is the golden ratio. Once the pipeline is filled, all processors are active in each step. Because inner nodes have two children, they have to receive <math>2 \cdot m</math> elements. Therefore the runtime of the algorithm is <math>T(p,m) \approx (p \cdot T_{byte} + T_{start})(h + 2 \cdot k - 2)</math>. It is minimal if the size of the vectors is chosen such that <math>m = \sqrt{\frac{n \cdot (h-3)T_{byte}}{3 \cdot T_{start}}}</math> where <math>n</math> is the total number of elements.
 
== Applications ==