Reduce (parallel pattern)

This is an old revision of this page, as edited by Flogr (talk | contribs) at 15:57, 6 March 2018 (PRAM-algorithm). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The reduction operation is a communication primitive used in the context of a parallel programming model to combine multiple vectors into one, using an associative binary operator . Every vector is present at a distinct processor in the beginning, the goal is to apply the operator in the order given by the processor-indices to the vectors until only one is left. The reduction is an integral part of programming models such as Map Reduce, where a function is applied (mapped) to all elements before they are reduced. Other parallel algorithms use the reduce operation as a primary operation to solve more complex problems. The Message Passing Interface implements it in the operations MPI_Reduce and MPI_Allreduce, with the difference that the result is available at one (root) processing unit or all of them. Closely related to the reduction is the broadcast operation, which distributes data to all processors. Many reduction algorithms can be used for broadcasting by reverting them and omitting the operator.

Definition

Formally, the reduction takes an associative (but not necessarily commutative) operator  , which can be evaluated in constant time and an input set  of   vectors of size  . The result   of the operation is the combination of the elements   and has to be stored at a specified root processor at the end of the execution. For example, the result of a reduction on the set  , where all vectors have size one is  . If the result   has to be available at every processor, then it is often called Allreduce. An optimal sequential linear-time algorithm for reduction can apply the operator successively from front to back, always replacing two vectors with the result of the operation applied to all its elements, thus creating an instance that has one vector less. It needs   steps until only   is left. Sequential algorithms can not perform better than linear time, but parallel algorithms leave some space left to optimize.

Binomial tree algorithms

Regarding parallel algorithms, there are two main models of parallel computation, the parallel random access machine as an extension of the RAM with shared memory between processing units and the bulk synchronous parallel computer which takes communication and synchronization into account. Both models have different implications for the time-complexity, therefore two algorithms will be shown.

PRAM-algorithm

This algorithm represents a widely spread method to handle inputs where   is a power of two. The reverse procedure is often used for broadcasting elements.[1][2][3]

 
Visualization of the algorithm executed on eight elements using addition as the operator
for   to   do
for   to   do in parallel
if   is active then
if bit   of   is set then
set   to inactive
else if  
 

The binary operator for vectors is defined such that  . The algorithm further assumes that in the beginning   for all   and   is a power of two and uses the processing units  . In every iteration, half of the processing units become inactive and do not contribute to further computations. The figure shows a visualization of the algorithm using addition as the operator. Vertical lines represent the processing units where the computation of the elements on that line take place. The eight input elements are located on the bottom and every animation step corresponds to one parallel step in the execution of the algorithm. An active processor   evaluates the given operator on the element   it is currently holding and   where   is the minimal index fulfilling  , so that   is becoming an inactive processor in the current step.   and   are not necessarily elements of the input set   as the fields are overwritten and reused for previously evaluated expressions. To coordinate the roles of the processing units in each step without causing additional communication between them, the fact that the processing units are indexed with numbers from   to   is used. Each processor looks at its  -th least significant bit and decides whether to get inactive or compute the operator on its own element and the element with the index where the  -th bit is not set. The underlying communication pattern of the algorithm is a binomial tree, hence the name of the algorithm.

Only   holds the result in the end, therefore it is the root processor. For an Allreduce-operation the result has to be distributed, which can be done by appending a broadcast from  . Furthermore, the number   of processors is restricted to be a power of two. This can be lifted by padding the number of processors to the next power of two. There are also algorithms that are more tailored for this use-case.[4]

Runtime analysis

The main loop is executed   times, the time needed for the part done in parallel is in   as a processing unit either combines two vectors or becomes inactive. Thus the parallel time   for the PRAM is   for  . The strategy for handling read and write conflicts can be chosen as restrictive as an exclusive read and exclusive write (EREW). The efficiency   of the algorithm is   and therefore the efficiency is  . The efficiency suffers because of the fact that half of the active processing units become inactive after each step, so   units are active in step  .

Distributed memory algorithm

In contrast to the PRAM-algorithm, in the distributed memory model memory is not shared between processing units and data has to be exchanged explicitly between units, resulting in communication overhead that is accounted for. The following algorithm takes this into consideration.

for   to   do
for   to   do in parallel
if   is active then
if bit   of   is set then
send   to  
set   to inactive
else if  
receive  
 

The only difference between the distributed algorithm and the PRAM version is the inclusion of explicit communication primitives, the operating principle stays the same.

Runtime analysis

A simple analysis for the algorithm uses the BSP-model and incorporates the time   needed to initiate communication and   the time needed to send a packet. Then the resulting runtime is  , as   elements of a vector are send in each iteration.

Pipeline-algorithm

 
Depiction of the pipeline-algorithm using addition as the operator on vectors of size four.

For distributed memory models, it can make sense to use pipelined communication. This is especially the case when   is small in comparison to  . Usually, linear pipelines split data or a task into smaller pieces and process them in stages. In contrast to the binomial tree algorithms, the pipelined algorithm uses the fact that the vectors are not inseparable, but the operator can be evaluated for single elements[5]:

for   to   do
for   to   do in parallel
if  
send   to  
if  
receive   from  
 

It is important to note that the send and receive operations have to be executed concurrently for the algorithm to work. The result vector is stored at   at the end. The associated animation shows an execution of the algorithm on vectors of size four with five processing units. Two steps of the animation visualize one parallel execution step. The number of steps in the parallel execution are  , it takes   steps until the last processing unit receives its first element and additional   until all elements are received. Therefore the runtime in the BSP-model is  , assuming that   is the total size of a vector.   is incorporated into the formula because it is possible to adapt the granularity in which the operator is computed. For example, a problem instance with vectors of size four can be handled by splitting the vectors into the first two and last two elements, which are always transmitted and computed together. In this case, double the volume is send each step, but the number of steps has roughly halved. Effectively, the example instance is just handled as an instance with vectors of size two by the algorithm. The runtime   for this approach depends on the value of  , which can be optimized when knowing   and  . It is optimal for  .

Pipelined tree

 
Pipelined Fibonacci-tree algorithm using addition.

The binomial tree and the pipeline both have their advantages and disadvantages, depending on the values of   and   for the parallel communication. They can be combined into one algorithm[6] which uses a tree as its underlying communication pattern and splits the computation of the operator into pieces at the same time. Instead of the binomial tree, a Fibonacci tree is used. The animation shows the execution of such an algorithm in a full-duplex communication model. The first frame shows the Fibonacci-tree which describes the communication links, afterwards blue arrows indicate the transmission of elements. A processing node is depicted by three neighboring boxes with elements inside and receives elements from its two children in turn (assuming there are children with valid values). The runtime is  , where   is the height of the tree and   the golden ratio.

Applications

Reduction is one of the main collective operations implemented in the Message Passing Interface, where performance of the used algorithm is important and evaluated constantly for different use cases.[7]

MapReduce relies heavily on efficient reduction algorithms to process big data sets, even on huge clusters.[8][9]

Some parallel sorting algorithms use reductions to be able to handle very big data sets.[10]This sandbox is in the article namespace. Either move this page into your userspace, or remove the {{User sandbox}} template.

  1. ^ Bar-Noy, Amotz; Kipnis, Shlomo. "Broadcasting multiple messages in simultaneous send/receive systems". Discrete Applied Mathematics. 55 (2): 95–105. doi:10.1016/0166-218x(94)90001-9.
  2. ^ Santos, Eunice E. "Optimal and Efficient Algorithms for Summing and Prefix Summing on Parallel Machines". Journal of Parallel and Distributed Computing. 62 (4): 517–543. doi:10.1006/jpdc.2000.1698.
  3. ^ Slater, P.; Cockayne, E.; Hedetniemi, S. (1981-11-01). "Information Dissemination in Trees". SIAM Journal on Computing. 10 (4): 692–701. doi:10.1137/0210052. ISSN 0097-5397.
  4. ^ Rabenseifner, Rolf; Träff, Jesper Larsson (2004-09-19). "More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems". Recent Advances in Parallel Virtual Machine and Message Passing Interface. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg: 36–46. doi:10.1007/978-3-540-30218-6_13. ISBN 9783540231639.
  5. ^ Bar-Noy, A.; Kipnis, S. (1994-09-01). "Designing broadcasting algorithms in the postal model for message-passing systems". Mathematical systems theory. 27 (5): 431–452. doi:10.1007/BF01184933. ISSN 0025-5661.
  6. ^ Sanders, Peter; Sibeyn, Jop F. "A bandwidth latency tradeoff for broadcast and reduction". Information Processing Letters. 86 (1): 33–38. doi:10.1016/s0020-0190(02)00473-8.
  7. ^ Pješivac-Grbović, Jelena; Angskun, Thara; Bosilca, George; Fagg, Graham E.; Gabriel, Edgar; Dongarra, Jack J. (2007-06-01). "Performance analysis of MPI collective operations". Cluster Computing. 10 (2): 127–143. doi:10.1007/s10586-007-0012-0. ISSN 1386-7857.
  8. ^ Lämmel, Ralf. "Google's MapReduce programming model — Revisited". Science of Computer Programming. 70 (1): 1–30. doi:10.1016/j.scico.2007.07.001.
  9. ^ Senger, Hermes; Gil-Costa, Veronica; Arantes, Luciana; Marcondes, Cesar A. C.; Marín, Mauricio; Sato, Liria M.; da Silva, Fabrício A.B. (2016-06-10). "BSP cost and scalability analysis for MapReduce operations". Concurrency and Computation: Practice and Experience. 28 (8): 2503–2527. doi:10.1002/cpe.3628. ISSN 1532-0634.
  10. ^ Axtmann, Michael; Bingmann, Timo; Sanders, Peter; Schulz, Christian (2014-10-24). "Practical Massively Parallel Sorting". arXiv:1410.6754 [cs].