Reduce (parallel pattern): Difference between revisions

Content deleted Content added
Flogr (talk | contribs)
No edit summary
renamed redirect
Tag: Redirect target changed
 
(33 intermediate revisions by 19 users not shown)
Line 1:
#REDIRECT [[Reduction operator]] {{R from merge}}
The reduction operation is a communication primitive used in the context of a [[parallel programming model]] to combine multiple vectors into one, using an [[Operator associativity|associative]] [[Binary operation|binary operator]] <math>\oplus</math>. Every vector is present at a distinct processor in the beginning, the goal is to apply the operator in the order given by the processor-indices to the vectors until only one is left. The reduction is an integral part of programming models such as [[MapReduce|Map Reduce]], where a [[Function (mathematics)|function]] is applied (mapped) to all elements before they are reduced. Other [[Parallel algorithm|parallel algorithms]] use the reduce operation as a primary operation to solve more complex problems. The [[Message Passing Interface]] implements it in the operations <code>MPI_Reduce</code> and <code>MPI_Allreduce</code>, with the difference that the result is available at one (root) processing unit or all of them. Closely related to the reduction is the broadcast operation, which distributes data to all processors. Many reduction algorithms can be used for broadcasting by reverting them and omitting the operator.
 
== Definition ==
Formally, the reduction takes an associative (but not necessarily commutative) operator <math>\oplus</math>, which can be evaluated in constant time and an input set <math>V = \{v_0 = \begin{pmatrix} e_0^0 \\ \vdots \\ e_0^{m-1}\end{pmatrix}, v_1 = \begin{pmatrix} e_1^0 \\ \vdots \\ e_1^{m-1}\end{pmatrix}, \dots, v_{p-1} = \begin{pmatrix} e_{p-1}^0 \\ \vdots \\ e_{p-1}^{m-1}\end{pmatrix}\}</math>of <math>p</math> vectors of size <math>m</math>. The result <math>r</math> of the operation is the combination of the elements <math>r = \begin{pmatrix} e_0^0 \oplus e_1^0 \oplus \dots \oplus e_{p-1}^0 \\ \vdots \\ e_0^{m-1} \oplus e_1^{m-1} \oplus \dots \oplus e_{p-1}^{m-1}\end{pmatrix} = \begin{pmatrix} \bigoplus_{i=0}^{p-1} e_i^0 \\ \vdots \\ \bigoplus_{i=0}^{p-1} e_i^{m-1} \end{pmatrix}</math> and has to be stored at a specified root processor at the end of the execution. For example, the result of a reduction on the set <math>\{3,5,7,9\}</math>, where all vectors have size one is <math>3 + 5 + 7 + 9 = 24</math>. If the result <math>r</math> has to be available at every processor after the computation has finished, it is often called Allreduce. An optimal sequential linear-time algorithm for reduction can apply the operator successively from front to back, always replacing two vectors with the result of the operation applied to all its elements, thus creating an instance that has one vector less. It needs <math>(p-1)\cdot m </math> steps until only <math>r</math> is left. Sequential algorithms can not perform better than linear time, but parallel algorithms leave some space left to optimize.
 
== Binomial tree algorithms ==
Regarding parallel algorithms, there are two main models of parallel computation, the [[Parallel random-access machine|parallel random access machine]] as an extension of the RAM with shared memory between processing units and the [[Bulk synchronous parallel|bulk synchronous parallel computer]] which takes communication and [[Synchronization (computer science)|synchronization]] into account. Both models have different implications for the [[Time complexity|time-complexity]], therefore two algorithms will be shown.
 
=== PRAM-algorithm ===
This algorithm represents a widely spread method to handle inputs where <math>p</math> is a power of two. The reverse procedure is often used for broadcasting elements.<ref>{{Cite journal|last=Bar-Noy|first=Amotz|last2=Kipnis|first2=Shlomo|title=Broadcasting multiple messages in simultaneous send/receive systems|url=https://doi.org/10.1016/0166-218X(94)90001-9|journal=Discrete Applied Mathematics|volume=55|issue=2|pages=95–105|doi=10.1016/0166-218x(94)90001-9}}</ref><ref>{{Cite journal|last=Santos|first=Eunice E.|title=Optimal and Efficient Algorithms for Summing and Prefix Summing on Parallel Machines|url=https://doi.org/10.1006/jpdc.2000.1698|journal=Journal of Parallel and Distributed Computing|volume=62|issue=4|pages=517–543|doi=10.1006/jpdc.2000.1698}}</ref><ref>{{Cite journal|last=Slater|first=P.|last2=Cockayne|first2=E.|last3=Hedetniemi|first3=S.|date=1981-11-01|title=Information Dissemination in Trees|url=http://epubs.siam.org/doi/10.1137/0210052|journal=SIAM Journal on Computing|volume=10|issue=4|pages=692–701|doi=10.1137/0210052|issn=0097-5397}}</ref>[[File:Binomial tree.gif|thumb|436x436px|
Visualization of the algorithm executed on eight elements using addition as the operator
]]
: '''for''' <math>k \gets 0</math> '''to''' <math>\lceil\log_2 p\rceil - 1</math> '''do'''
:: '''for''' <math>i \gets 0</math> '''to''' <math>p - 1</math> '''do in parallel'''
::: '''if''' <math>p_i</math> '''is active then'''
:::: '''if bit''' <math>k</math> '''of''' <math>i</math> '''is set then'''
::::: '''set''' <math>p_i</math> '''to inactive'''
:::: '''else if''' <math>i + 2^k < p</math>
::::: <math>x_i \gets x_i \oplus^{\star} x_{i+2^k}</math>
The binary operator for vectors is defined such that <math>\begin{pmatrix} e_i^0 \\ \vdots \\ e_i^{m-1}\end{pmatrix} \oplus^\star \begin{pmatrix} e_j^0 \\ \vdots \\ e_j^{m-1}\end{pmatrix} = \begin{pmatrix} e_i^0 \oplus e_j^0 \\ \vdots \\ e_i^{m-1} \oplus e_j^{m-1} \end{pmatrix}</math>. The algorithm further assumes that in the beginning <math>x_i = v_i</math> for all <math>i</math> and <math>p</math> is a power of two and uses the processing units <math>p_0,p_1,\dots p_{n-1}</math>. In every iteration, half of the processing units become inactive and do not contribute to further computations. The figure shows a visualization of the algorithm using addition as the operator. Vertical lines represent the processing units where the computation of the elements on that line take place. The eight input elements are located on the bottom and every animation step corresponds to one parallel step in the execution of the algorithm. An active processor <math>p_i</math> evaluates the given operator on the element <math>x_i</math> it is currently holding and <math>x_j</math> where <math>j</math> is the minimal index fulfilling <math>j > i</math>, so that <math>p_j</math> is becoming an inactive processor in the current step. <math>x_i</math> and <math>x_j</math> are not necessarily elements of the input set <math>X</math> as the fields are overwritten and reused for previously evaluated expressions. To coordinate the roles of the processing units in each step without causing additional communication between them, the fact that the processing units are indexed with numbers from <math>0</math> to <math>p-1</math> is used. Each processor looks at its <math>k</math>-th least significant bit and decides whether to get inactive or compute the operator on its own element and the element with the index where the <math>k</math>-th bit is not set. The underlying communication pattern of the algorithm is a binomial tree, hence the name of the algorithm.
 
Only <math>p_0</math> holds the result in the end, therefore it is the root processor. For an Allreduce-operation the result has to be distributed, which can be done by appending a broadcast from <math>p_0</math>. Furthermore, the number <math>p</math> of processors is restricted to be a power of two. This can be lifted by padding the number of processors to the next power of two. There are also algorithms that are more tailored for this use-case.<ref>{{Cite journal|last=Rabenseifner|first=Rolf|last2=Träff|first2=Jesper Larsson|date=2004-09-19|title=More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems|url=https://link.springer.com/chapter/10.1007/978-3-540-30218-6_13|journal=Recent Advances in Parallel Virtual Machine and Message Passing Interface|series=Lecture Notes in Computer Science|language=en|publisher=Springer, Berlin, Heidelberg|pages=36–46|doi=10.1007/978-3-540-30218-6_13|isbn=9783540231639}}</ref>
 
==== Runtime analysis ====
The main loop is executed <math>\lceil\log_2 p\rceil</math> times, the time needed for the part done in parallel is in <math>\mathcal{O}(m)</math> as a processing unit either combines two vectors or becomes inactive. Thus the parallel time <math>T(p,m)</math> for the PRAM is <math>T(p,m) = \mathcal{O}(log(p) \cdot m)</math>. The strategy for handling read and write conflicts can be chosen as restrictive as an exclusive read and exclusive write (EREW). The efficiency <math>S(p,m)</math> of the algorithm is <math>S(p,m) \in \mathcal{O}(\frac{T_{seq}}{T(p,m)}) = \mathcal{O}(\frac{p}{log(p)})</math> and therefore the efficiency is <math>E(p,m) \in \mathcal{O}(\frac{S(p,m)}{p}) = \mathcal{O}(\frac{1}{log(p)})</math>. The efficiency suffers because of the fact that half of the active processing units become inactive after each step, so <math>\frac{p}{2^i}</math> units are active in step <math>i</math>.
 
=== Distributed memory algorithm ===
In contrast to the PRAM-algorithm, in the [[distributed memory]] model memory is not shared between processing units and data has to be exchanged explicitly between units, resulting in communication overhead that is accounted for. The following algorithm takes this into consideration.
: '''for''' <math>k \gets 0</math> '''to''' <math>\lceil\log_2 p\rceil - 1</math> '''do'''
:: '''for''' <math>i \gets 0</math> '''to''' <math>p - 1</math> '''do in parallel'''
::: '''if''' <math>p_i</math> '''is active then'''
:::: '''if bit''' <math>k</math> '''of''' <math>i</math> '''is set then'''
::::: '''send''' <math>x_i</math> '''to''' <math>p_{i-2^k}</math>
::::: '''set''' <math>p_k</math> '''to inactive'''
:::: '''else if''' <math>i + 2^k < p</math>
::::: '''receive''' <math>x_{i+2^k}</math>
::::: <math>x_i \gets x_i \oplus^\star x_{i+2^k}</math>
The only difference between the distributed algorithm and the PRAM version is the inclusion of explicit communication primitives, the operating principle stays the same.
 
==== Runtime analysis ====
A simple analysis for the algorithm uses the BSP-model and incorporates the time <math>l</math> needed to initiate communication and <math>g</math> the time needed to send a packet. Then the resulting runtime is <math>\Theta((l + m \cdot g)\cdot log(p))</math>, as <math>m</math> elements of a vector are send in each iteration.
 
== Pipeline-algorithm ==
[[File:Pipeline reduce.gif|thumb|476x476px|Depiction of the pipeline-algorithm using addition as the operator on vectors of size four.]]For distributed memory models, it can make sense to use pipelined communication. This is especially the case when <math>T_{start}</math> is small in comparison to <math>T_{byte}</math>. Usually, [[Pipeline (computing)|linear pipelines]] split data or a task into smaller pieces and process them in stages. In contrast to the binomial tree algorithms, the pipelined algorithm uses the fact that the vectors are not inseparable, but the operator can be evaluated for single elements<ref>{{Cite journal|last=Bar-Noy|first=A.|last2=Kipnis|first2=S.|date=1994-09-01|title=Designing broadcasting algorithms in the postal model for message-passing systems|url=https://link.springer.com/article/10.1007/BF01184933|journal=Mathematical systems theory|language=en|volume=27|issue=5|pages=431–452|doi=10.1007/BF01184933|issn=0025-5661}}</ref>:
 
:'''for''' <math>k \gets 0</math> '''to''' <math>p+m-3</math> '''do'''
:: '''for''' <math>i \gets 0</math> '''to''' <math>p - 1</math> '''do in parallel'''
::: '''if''' <math>i \leq k < i+m \land i \neq p-1</math>
:::: '''send''' <math>x_i^{k-i} </math> '''to''' <math>p_{i+1} </math>
::: '''if''' <math>i-1 \leq k < i-1+m \land i \neq 0</math>
:::: '''receive''' <math>x_{i-1}^{k+i-1}</math> '''from''' <math>p_{i-1}</math>
:::: <math>x_{i}^{k+i-1} \gets x_{i}^{k+i-1} \oplus x_{i-1}^{k+i-1}</math>
It is important to note that the send and receive operations have to be executed concurrently for the algorithm to work. The result vector is stored at <math>p_{p-1}</math> at the end. The associated animation shows an execution of the algorithm on vectors of size four with five processing units. Two steps of the animation visualize one parallel execution step. The number of steps in the parallel execution are <math>p + m -2</math>, it takes <math>p-1</math> steps until the last processing unit receives its first element and additional <math>m-1</math> until all elements are received. Therefore the runtime in the BSP-model is <math>T(n,p,m) = (T_{start} + \frac{n}{m}T_{byte})(p+m-2)</math>, assuming that <math>n</math> is the total byte-size of all elements together.
 
<math>n</math> is incorporated into the formula because it is possible to adapt the granularity in which the operator is computed. For example, a problem instance with vectors of size four can be handled by splitting the vectors into the first two and last two elements, which are always transmitted and computed together. In this case, double the volume is send each step, but the number of steps has roughly halved. It means that the parameter <math>m</math> is halfed, while the total byte-size <math>n</math> stays the same. The runtime <math>T(p)</math> for this approach depends on the value of <math>m</math>, which can be optimized if <math>T_{start}</math> and <math>T_{byte}</math> are known. It is optimal for <math>m = \sqrt{\frac{n \cdot (p-2)T_{byte}}{T_{start}}}</math>, assuming that this results in a smaller <math>m</math> that divides the original one.
 
== Pipelined tree ==
[[File:Pipelined binomial.gif|thumb|483x483px|Pipelined Fibonacci-tree algorithm using addition.]]
The binomial tree and the pipeline both have their advantages and disadvantages, depending on the values of <math>T_{start}</math> and <math>T_{byte}</math> for the parallel communication. While the binomial tree algorithm is better suited for small vectors, the pipelined algorithm profits from a distribution of the elements to fewer processing units with more elements contained in one vector. Both approaches can be combined into one algorithm<ref>{{Cite journal|last=Sanders|first=Peter|last2=Sibeyn|first2=Jop F|title=A bandwidth latency tradeoff for broadcast and reduction|url=https://doi.org/10.1016/S0020-0190(02)00473-8|journal=Information Processing Letters|volume=86|issue=1|pages=33–38|doi=10.1016/s0020-0190(02)00473-8}}</ref> which uses a tree as its underlying communication pattern and splits the computation of the operator into pieces at the same time. Instead of the binomial tree, a Fibonacci tree is used which has the property that the height of the trees rooted at its two children differ by one. It helps to balance the load on all processing units as each unit can only evaluate one operator in one iteration on one of its elements, but it has two child-processors it receives values from.
 
=== Algorithm description ===
The animation shows the execution of such an algorithm in a full-duplex communication model. Communication links are represented by black lines between the vectors of elements and build a Fibonacci tree of size seven in this example. When an element is send to another processing unit the link is colored with the color of the corresponding element. An element that is received by a processor is added to the already existing element of same color (at the same index in the vector).
 
The algorithm itself propagates the partial sums from bottom to top until all elements are contained in the sum at the root processor on top. In the first step of the execution, the processing units which are leafs in the underlying tree send their first elements to their parent. This is similar to the send operations of the binomial tree algorithm with the key difference that the leaf units each have two more elements which have to be send and therefore do not become inactive, but can continue to send elements, which is analogous to the pipelined approach and improves efficiency. Processing units that are not leafs start to send their elements in order of the indices in the vector once they have received an element from a child. In the example they send green, blue and red elements in this order. If two processors compete to send their elements to the same processor, then the element of the right child is received first. Because of the structure of the Fibonacci tree all processors send or receive elements while the "pipeline" is filled. The pipeline is filled from the point where each unit has received an element until the leaf units have no more elements to send.
 
=== Runtime ===
Each iteration of the algorithm takes at most time <math>p \cdot T_{byte} + T_{start}</math>. The height of the tree factors into the time it needs to fill the pipeline and for Fibonacci trees it is known to be about <math>h = log_{\phi}p</math> where <math>\phi = \frac{1 + \sqrt{5}}{2}</math> is the golden ratio. Once the pipeline is filled, all processors are active in each step. Because inner nodes have two children, they have to receive <math>2 \cdot m</math> elements. Therefore the runtime of the algorithm is <math>T(p,m) \approx (p \cdot T_{byte} + T_{start})(h + 2 \cdot k - 2)</math>. It is minimal if the size of the vectors is chosen such that <math>m = \sqrt{\frac{n \cdot (h-3)T_{byte}}{3 \cdot T_{start}}}</math> where <math>n</math> is the total number of elements.
 
== Applications ==
Reduction is one of the main collective operations implemented in the [[Message Passing Interface]], where performance of the used algorithm is important and evaluated constantly for different use cases.<ref>{{Cite journal|last=Pješivac-Grbović|first=Jelena|last2=Angskun|first2=Thara|last3=Bosilca|first3=George|last4=Fagg|first4=Graham E.|last5=Gabriel|first5=Edgar|last6=Dongarra|first6=Jack J.|date=2007-06-01|title=Performance analysis of MPI collective operations|url=https://link.springer.com/article/10.1007/s10586-007-0012-0|journal=Cluster Computing|language=en|volume=10|issue=2|pages=127–143|doi=10.1007/s10586-007-0012-0|issn=1386-7857}}</ref>
 
[[MapReduce]] relies heavily on efficient reduction algorithms to process big data sets, even on huge clusters.<ref>{{Cite journal|last=Lämmel|first=Ralf|title=Google’s MapReduce programming model — Revisited|url=https://doi.org/10.1016/j.scico.2007.07.001|journal=Science of Computer Programming|volume=70|issue=1|pages=1–30|doi=10.1016/j.scico.2007.07.001}}</ref><ref>{{Cite journal|last=Senger|first=Hermes|last2=Gil-Costa|first2=Veronica|last3=Arantes|first3=Luciana|last4=Marcondes|first4=Cesar A. C.|last5=Marín|first5=Mauricio|last6=Sato|first6=Liria M.|last7=da Silva|first7=Fabrício A.B.|date=2016-06-10|title=BSP cost and scalability analysis for MapReduce operations|url=http://onlinelibrary.wiley.com/doi/10.1002/cpe.3628/abstract|journal=Concurrency and Computation: Practice and Experience|language=en|volume=28|issue=8|pages=2503–2527|doi=10.1002/cpe.3628|issn=1532-0634}}</ref>
 
Some parallel [[Sorting algorithm|sorting]] algorithms use reductions to be able to handle very big data sets.<ref>{{Cite journal|last=Axtmann|first=Michael|last2=Bingmann|first2=Timo|last3=Sanders|first3=Peter|last4=Schulz|first4=Christian|date=2014-10-24|title=Practical Massively Parallel Sorting|url=http://arxiv.org/abs/1410.6754|journal=arXiv:1410.6754 [cs]}}</ref>{{User sandbox}}
<!-- EDIT BELOW THIS LINE -->
== References ==