Reduce (parallel pattern): Difference between revisions

Content deleted Content added
No edit summary
Citation bot (talk | contribs)
m Alter: title, journal. Add: arxiv, volume, year. You can use this bot yourself. Report bugs here.
Line 12:
 
=== PRAM-algorithm ===
This algorithm represents a widely spread method to handle inputs where <math>p</math> is a power of two. The reverse procedure is often used for broadcasting elements.<ref>{{Cite journal|last=Bar-Noy|first=Amotz|last2=Kipnis|first2=Shlomo|title=Broadcasting multiple messages in simultaneous send/receive systems|url=https://doi.org/10.1016/0166-218X(94)90001-9|journal=Discrete Applied Mathematics|volume=55|issue=2|pages=95–105|doi=10.1016/0166-218x(94)90001-9|year=1994}}</ref><ref>{{Cite journal|last=Santos|first=Eunice E.|title=Optimal and Efficient Algorithms for Summing and Prefix Summing on Parallel Machines|url=https://doi.org/10.1006/jpdc.2000.1698|journal=Journal of Parallel and Distributed Computing|volume=62|issue=4|pages=517–543|doi=10.1006/jpdc.2000.1698|year=2002}}</ref><ref>{{Cite journal|last=Slater|first=P.|last2=Cockayne|first2=E.|last3=Hedetniemi|first3=S.|date=1981-11-01|title=Information Dissemination in Trees|url=http://epubs.siam.org/doi/10.1137/0210052|journal=SIAM Journal on Computing|volume=10|issue=4|pages=692–701|doi=10.1137/0210052|issn=0097-5397}}</ref>[[File:Binomial tree.gif|thumb|436x436px|
Visualization of the algorithm executed on eight elements using addition as the operator
]]
Line 24:
The binary operator for vectors is defined such that <math>\begin{pmatrix} e_i^0 \\ \vdots \\ e_i^{m-1}\end{pmatrix} \oplus^\star \begin{pmatrix} e_j^0 \\ \vdots \\ e_j^{m-1}\end{pmatrix} = \begin{pmatrix} e_i^0 \oplus e_j^0 \\ \vdots \\ e_i^{m-1} \oplus e_j^{m-1} \end{pmatrix}</math>. The algorithm further assumes that in the beginning <math>x_i = v_i</math> for all <math>i</math> and <math>p</math> is a power of two and uses the processing units <math>p_0,p_1,\dots p_{n-1}</math>. In every iteration, half of the processing units become inactive and do not contribute to further computations. The figure shows a visualization of the algorithm using addition as the operator. Vertical lines represent the processing units where the computation of the elements on that line take place. The eight input elements are located on the bottom and every animation step corresponds to one parallel step in the execution of the algorithm. An active processor <math>p_i</math> evaluates the given operator on the element <math>x_i</math> it is currently holding and <math>x_j</math> where <math>j</math> is the minimal index fulfilling <math>j > i</math>, so that <math>p_j</math> is becoming an inactive processor in the current step. <math>x_i</math> and <math>x_j</math> are not necessarily elements of the input set <math>X</math> as the fields are overwritten and reused for previously evaluated expressions. To coordinate the roles of the processing units in each step without causing additional communication between them, the fact that the processing units are indexed with numbers from <math>0</math> to <math>p-1</math> is used. Each processor looks at its <math>k</math>-th least significant bit and decides whether to get inactive or compute the operator on its own element and the element with the index where the <math>k</math>-th bit is not set. The underlying communication pattern of the algorithm is a binomial tree, hence the name of the algorithm.
 
Only <math>p_0</math> holds the result in the end, therefore it is the root processor. For an Allreduce-operation the result has to be distributed, which can be done by appending a broadcast from <math>p_0</math>. Furthermore, the number <math>p</math> of processors is restricted to be a power of two. This can be lifted by padding the number of processors to the next power of two. There are also algorithms that are more tailored for this use-case.<ref>{{Cite journal|last=Rabenseifner|first=Rolf|last2=Träff|first2=Jesper Larsson|date=2004-09-19|title=More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems|url=https://link.springer.com/chapter/10.1007/978-3-540-30218-6_13|journal=Recent Advances in Parallel Virtual Machine and Message Passing Interface|volume=3241|series=Lecture Notes in Computer Science|language=en|publisher=Springer, Berlin, Heidelberg|pages=36–46|doi=10.1007/978-3-540-30218-6_13|isbn=9783540231639}}</ref>
 
==== Runtime analysis ====
Line 61:
== Pipelined tree ==
[[File:Pipelined binomial.gif|thumb|483x483px|Pipelined Fibonacci-tree algorithm using addition.]]
The binomial tree and the pipeline both have their advantages and disadvantages, depending on the values of <math>T_{start}</math> and <math>T_{byte}</math> for the parallel communication. While the binomial tree algorithm is better suited for small vectors, the pipelined algorithm profits from a distribution of the elements to fewer processing units with more elements contained in one vector. Both approaches can be combined into one algorithm<ref>{{Cite journal|last=Sanders|first=Peter|last2=Sibeyn|first2=Jop F|title=A bandwidth latency tradeoff for broadcast and reduction|url=https://doi.org/10.1016/S0020-0190(02)00473-8|journal=Information Processing Letters|volume=86|issue=1|pages=33–38|doi=10.1016/s0020-0190(02)00473-8|year=2003}}</ref> which uses a tree as its underlying communication pattern and splits the computation of the operator into pieces at the same time. Instead of the binomial tree, a [[Fibonacci number|Fibonacci]] tree is used which has the property that the height of the trees rooted at its two children differ by one. It helps to balance the load on all processing units as each unit can only evaluate one operator in one iteration on one of its elements, but it has two child-processors it receives values from.
 
=== Algorithm description ===
Line 74:
Reduction is one of the main collective operations implemented in the [[Message Passing Interface]], where performance of the used algorithm is important and evaluated constantly for different use cases.<ref>{{Cite journal|last=Pješivac-Grbović|first=Jelena|last2=Angskun|first2=Thara|last3=Bosilca|first3=George|last4=Fagg|first4=Graham E.|last5=Gabriel|first5=Edgar|last6=Dongarra|first6=Jack J.|date=2007-06-01|title=Performance analysis of MPI collective operations|url=https://link.springer.com/article/10.1007/s10586-007-0012-0|journal=Cluster Computing|language=en|volume=10|issue=2|pages=127–143|doi=10.1007/s10586-007-0012-0|issn=1386-7857}}</ref>
 
[[MapReduce]] relies heavily on efficient reduction algorithms to process big data sets, even on huge clusters.<ref>{{Cite journal|last=Lämmel|first=Ralf|title=Google’sGoogle's MapReduce programming model — Revisited|url=https://doi.org/10.1016/j.scico.2007.07.001|journal=Science of Computer Programming|volume=70|issue=1|pages=1–30|doi=10.1016/j.scico.2007.07.001|year=2008}}</ref><ref>{{Cite journal|last=Senger|first=Hermes|last2=Gil-Costa|first2=Veronica|last3=Arantes|first3=Luciana|last4=Marcondes|first4=Cesar A. C.|last5=Marín|first5=Mauricio|last6=Sato|first6=Liria M.|last7=da Silva|first7=Fabrício A.B.|date=2016-06-10|title=BSP cost and scalability analysis for MapReduce operations|url=http://onlinelibrary.wiley.com/doi/10.1002/cpe.3628/abstract|journal=Concurrency and Computation: Practice and Experience|language=en|volume=28|issue=8|pages=2503–2527|doi=10.1002/cpe.3628|issn=1532-0634}}</ref>
 
Some parallel [[Sorting algorithm|sorting]] algorithms use reductions to be able to handle very big data sets.<ref>{{Cite journal|last=Axtmann|first=Michael|last2=Bingmann|first2=Timo|last3=Sanders|first3=Peter|last4=Schulz|first4=Christian|date=2014-10-24|title=Practical Massively Parallel Sorting|url=https://arxiv.org/abs/1410.6754|journal=arXivArXiv:1410.6754 [cs]|arxiv=1410.6754}}</ref>
== References ==
<references />{{User sandbox}}