Revision as of 08:35, 20 May 2018 edit Legacypac (talk \| contribs) Extended confirmed users, Pending changes reviewers 158,031 edits m Legacypac moved page Draft:Reduce (parallel pattern) to Reduce (parallel pattern): Publishing accepted Articles for creation submission (AFCH 0.9) ← Previous edit		Revision as of 08:35, 20 May 2018 edit undo Legacypac (talk \| contribs) Extended confirmed users, Pending changes reviewers 158,031 edits Cleaning up accepted Articles for creation submission (AFCH 0.9) Next edit →
Line 1: ~~{{AFC submission\|\|\|ts=20180328143408\|u=Flogr\|ns=118}}~~ {{AFC comment\|1=Since the scope seems to be solely the use of reduce in parallel programming, then the title should reflect that more clearly, for example by being called Reduce (parallel pattern) in analogy to [[Map (parallel pattern)]]. [[User:Rolf h nelson\|Rolf H Nelson]] ([[User talk:Rolf h nelson\|talk]]) 05:20, 4 May 2018 (UTC)}} Reduce is a collective communication primitive used in the context of a [[parallel programming model]] to combine multiple vectors into one, using an [[Operator associativity\|associative]] [[Binary operation\|binary operator]] <math>\oplus</math>. Every vector is present at a distinct processor in the beginning. The goal of the primitive is to apply the operator in the order given by the processor-indices to the vectors until only one is left. The reduction of sets of elements is an integral part of programming models such as [[MapReduce\|Map Reduce]], where a [[Function (mathematics)\|function]] is applied ([[Map (higher-order function)\|mapped]]) to all elements before they are reduced. Other [[Parallel algorithm\|parallel algorithms]] use reduce as a primary operation to solve more complex problems. The [[Message Passing Interface]] implements it in the operations <code>MPI_Reduce</code> and <code>MPI_Allreduce</code>, with the difference that the result is available at one (root) processing unit or all of them. Closely related to reduce is the broadcast operation, which distributes data to all processors. Many reduce algorithms can be used for broadcasting by reverting them and omitting the operator. Line 24 ⟶ 20: The binary operator for vectors is defined such that <math>\begin{pmatrix} e_i^0 \\ \vdots \\ e_i^{m-1}\end{pmatrix} \oplus^\star \begin{pmatrix} e_j^0 \\ \vdots \\ e_j^{m-1}\end{pmatrix} = \begin{pmatrix} e_i^0 \oplus e_j^0 \\ \vdots \\ e_i^{m-1} \oplus e_j^{m-1} \end{pmatrix}</math>. The algorithm further assumes that in the beginning <math>x_i = v_i</math> for all <math>i</math> and <math>p</math> is a power of two and uses the processing units <math>p_0,p_1,\dots p_{n-1}</math>. In every iteration, half of the processing units become inactive and do not contribute to further computations. The figure shows a visualization of the algorithm using addition as the operator. Vertical lines represent the processing units where the computation of the elements on that line take place. The eight input elements are located on the bottom and every animation step corresponds to one parallel step in the execution of the algorithm. An active processor <math>p_i</math> evaluates the given operator on the element <math>x_i</math> it is currently holding and <math>x_j</math> where <math>j</math> is the minimal index fulfilling <math>j > i</math>, so that <math>p_j</math> is becoming an inactive processor in the current step. <math>x_i</math> and <math>x_j</math> are not necessarily elements of the input set <math>X</math> as the fields are overwritten and reused for previously evaluated expressions. To coordinate the roles of the processing units in each step without causing additional communication between them, the fact that the processing units are indexed with numbers from <math>0</math> to <math>p-1</math> is used. Each processor looks at its <math>k</math>-th least significant bit and decides whether to get inactive or compute the operator on its own element and the element with the index where the <math>k</math>-th bit is not set. The underlying communication pattern of the algorithm is a binomial tree, hence the name of the algorithm. Only <math>p_0</math> holds the result in the end, therefore it is the root processor. For an Allreduce-operation the result has to be distributed, which can be done by appending a broadcast from <math>p_0</math>. Furthermore, the number <math>p</math> of processors is restricted to be a power of two. This can be lifted by padding the number of processors to the next power of two. There are also algorithms that are more tailored for this use-case.<ref>{{Cite journal\|last=Rabenseifner\|first=Rolf\|last2=Träff\|first2=Jesper Larsson\|date=2004-09-19\|title=More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems\|url=https://link.springer.com/chapter/10.1007/978-3-540-30218-6_13\|journal=Recent Advances in Parallel Virtual Machine and Message Passing Interface\|volume=3241\|series=Lecture Notes in Computer Science\|language=en\|publisher=Springer, Berlin, Heidelberg\|pages=36–46\|doi=10.1007/978-3-540-30218-6_13\|isbn=9783540231639}}</ref> ==== Runtime analysis ==== Line 46 ⟶ 42: == Pipeline-algorithm == [[File:Pipeline reduce.gif\|thumb\|476x476px\|Depiction of the pipeline-algorithm using addition as the operator on vectors of size four.]]For distributed memory models, it can make sense to use pipelined communication. This is especially the case when <math>T_{start}</math> is small in comparison to <math>T_{byte}</math>. Usually, [[Pipeline (computing)\|linear pipelines]] split data or a task into smaller pieces and process them in stages. In contrast to the binomial tree algorithms, the pipelined algorithm uses the fact that the vectors are not inseparable, but the operator can be evaluated for single elements:<ref>{{Cite journal\|last=Bar-Noy\|first=A.\|last2=Kipnis\|first2=S.\|date=1994-09-01\|title=Designing broadcasting algorithms in the postal model for message-passing systems\|url=https://link.springer.com/article/10.1007/BF01184933\|journal=Mathematical systems theory\|language=en\|volume=27\|issue=5\|pages=431–452\|doi=10.1007/BF01184933\|issn=0025-5661}}</ref>: :'''for''' <math>k \gets 0</math> '''to''' <math>p+m-3</math> '''do''' Line 78 ⟶ 74: Some parallel [[Sorting algorithm\|sorting]] algorithms use reductions to be able to handle very big data sets.<ref>{{Cite journal\|last=Axtmann\|first=Michael\|last2=Bingmann\|first2=Timo\|last3=Sanders\|first3=Peter\|last4=Schulz\|first4=Christian\|date=2014-10-24\|title=Practical Massively Parallel Sorting\|url=https://arxiv.org/abs/1410.6754\|journal=ArXiv:1410.6754 [cs]\|arxiv=1410.6754}}</ref> == References == <references />~~{{User sandbox}}~~

Reduce (parallel pattern): Difference between revisions