Reduce (parallel pattern): Difference between revisions

Content deleted Content added
Urbic (talk | contribs)
m Typo fixing, replaced: ,m → , m (9), ,p → , p (3), halfed → halved
Line 20:
:::: '''else if''' <math>i + 2^k < p</math>
::::: <math>x_i \gets x_i \oplus^{\star} x_{i+2^k}</math>
The binary operator for vectors is defined such that <math>\begin{pmatrix} e_i^0 \\ \vdots \\ e_i^{m-1}\end{pmatrix} \oplus^\star \begin{pmatrix} e_j^0 \\ \vdots \\ e_j^{m-1}\end{pmatrix} = \begin{pmatrix} e_i^0 \oplus e_j^0 \\ \vdots \\ e_i^{m-1} \oplus e_j^{m-1} \end{pmatrix}</math>. The algorithm further assumes that in the beginning <math>x_i = v_i</math> for all <math>i</math> and <math>p</math> is a power of two and uses the processing units <math>p_0, p_1,\dots p_{n-1}</math>. In every iteration, half of the processing units become inactive and do not contribute to further computations. The figure shows a visualization of the algorithm using addition as the operator. Vertical lines represent the processing units where the computation of the elements on that line take place. The eight input elements are located on the bottom and every animation step corresponds to one parallel step in the execution of the algorithm. An active processor <math>p_i</math> evaluates the given operator on the element <math>x_i</math> it is currently holding and <math>x_j</math> where <math>j</math> is the minimal index fulfilling <math>j > i</math>, so that <math>p_j</math> is becoming an inactive processor in the current step. <math>x_i</math> and <math>x_j</math> are not necessarily elements of the input set <math>X</math> as the fields are overwritten and reused for previously evaluated expressions. To coordinate the roles of the processing units in each step without causing additional communication between them, the fact that the processing units are indexed with numbers from <math>0</math> to <math>p-1</math> is used. Each processor looks at its <math>k</math>-th least significant bit and decides whether to get inactive or compute the operator on its own element and the element with the index where the <math>k</math>-th bit is not set. The underlying communication pattern of the algorithm is a binomial tree, hence the name of the algorithm.
 
Only <math>p_0</math> holds the result in the end, therefore it is the root processor. For an Allreduce-operation the result has to be distributed, which can be done by appending a broadcast from <math>p_0</math>. Furthermore, the number <math>p</math> of processors is restricted to be a power of two. This can be lifted by padding the number of processors to the next power of two. There are also algorithms that are more tailored for this use-case.<ref>{{Cite book|last=Rabenseifner|first=Rolf|last2=Träff|first2=Jesper Larsson|date=2004-09-19|title=More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems|journal=Recent Advances in Parallel Virtual Machine and Message Passing Interface|volume=3241|series=Lecture Notes in Computer Science|language=en|publisher=Springer, Berlin, Heidelberg|pages=36–46|doi=10.1007/978-3-540-30218-6_13|isbn=9783540231639}}</ref>
 
==== Runtime analysis ====
The main loop is executed <math>\lceil\log_2 p\rceil</math> times, the time needed for the part done in parallel is in <math>\mathcal{O}(m)</math> as a processing unit either combines two vectors or becomes inactive. Thus the parallel time <math>T(p, m)</math> for the PRAM is <math>T(p, m) = \mathcal{O}(\log(p) \cdot m)</math>. The strategy for handling read and write conflicts can be chosen as restrictive as an exclusive read and exclusive write (EREW). The efficiency <math>S(p, m)</math> of the algorithm is <math>S(p, m) \in \mathcal{O}(\frac{T_{seq}}{T(p, m)}) = \mathcal{O}(\frac{p}{\log(p)})</math> and therefore the efficiency is <math>E(p, m) \in \mathcal{O}(\frac{S(p, m)}{p}) = \mathcal{O}(\frac{1}{\log(p)})</math>. The efficiency suffers because of the fact that half of the active processing units become inactive after each step, so <math>\frac{p}{2^i}</math> units are active in step <math>i</math>.
 
=== Distributed memory algorithm ===
Line 53:
:::: '''receive''' <math>x_{i-1}^{k+i-1}</math> '''from''' <math>p_{i-1}</math>
:::: <math>x_{i}^{k+i-1} \gets x_{i}^{k+i-1} \oplus x_{i-1}^{k+i-1}</math>
It is important to note that the send and receive operations have to be executed concurrently for the algorithm to work. The result vector is stored at <math>p_{p-1}</math> at the end. The associated animation shows an execution of the algorithm on vectors of size four with five processing units. Two steps of the animation visualize one parallel execution step. The number of steps in the parallel execution are <math>p + m -2</math>, it takes <math>p-1</math> steps until the last processing unit receives its first element and additional <math>m-1</math> until all elements are received. Therefore, the runtime in the BSP-model is <math>T(n, p, m) = (T_{start} + \frac{n}{m}T_{byte})(p+m-2)</math>, assuming that <math>n</math> is the total byte-size of a vector.
 
Although <math>m</math> has a fixed value, it is possible to logically group elements of a vector together and reduce <math>m</math>. For example, a problem instance with vectors of size four can be handled by splitting the vectors into the first two and last two elements, which are always transmitted and computed together. In this case, double the volume is send each step, but the number of steps has roughly halved. It means that the parameter <math>m</math> is halfedhalved, while the total byte-size <math>n</math> stays the same. The runtime <math>T(p)</math> for this approach depends on the value of <math>m</math>, which can be optimized if <math>T_{start}</math> and <math>T_{byte}</math> are known. It is optimal for <math>m = \sqrt{\frac{n \cdot (p-2)T_{byte}}{T_{start}}}</math>, assuming that this results in a smaller <math>m</math> that divides the original one.
 
== Pipelined tree ==
Line 67:
 
=== Runtime ===
Each iteration of the algorithm takes at most time <math>\frac{n}{m} \cdot T_{byte} + T_{start}</math>. The height of the tree factors into the time it needs to fill the pipeline and for Fibonacci trees it is known to be about <math>h = log_{\phi}p</math> where <math>\phi = \frac{1 + \sqrt{5}}{2}</math> is the golden ratio. Once the pipeline is filled, all processors are active in each step. Because inner nodes have two children, they have to receive <math>2 \cdot m</math> elements. Therefore, the runtime of the algorithm is <math>T(n, p, m) \approx (\frac{n}{m} \cdot T_{byte} + T_{start})(h + 2 \cdot k - 2)</math>. It is minimal if the number of elements in a vector is chosen such that <math>m = \sqrt{\frac{n \cdot (h-3)T_{byte}}{3 \cdot T_{start}}}</math>.
 
== Applications ==