The reduction operationReduce is a collective communication primitive used in the context of a [[parallel programming model]] to combine multiple vectors into one, using an [[Operator associativity|associative]] [[Binary operation|binary operator]] <math>\oplus</math>. Every vector is present at a distinct processor in the beginning,. theThe goal of the primitive is to apply the operator in the order given by the processor-indices to the vectors until only one is left. The reduction of sets of elements is an integral part of programming models such as [[MapReduce|Map Reduce]], where a [[Function (mathematics)|function]] is applied ([[Map (higher-order function)|mapped]]) to all elements before they are reduced. Other [[Parallel algorithm|parallel algorithms]] use the reduction operationreduce as a primary operation to solve more complex problems. The [[Message Passing Interface]] implements it in the operations <code>MPI_Reduce</code> and <code>MPI_Allreduce</code>, with the difference that the result is available at one (root) processing unit or all of them. Closely related to the reductionreduce is the broadcast operation, which distributes data to all processors. Many reductionreduce algorithms can be used for broadcasting by reverting them and omitting the operator.
== Definition ==
Formally, the reductionreduce takes an [[Associative property|associative]] (but not necessarily [[Commutative property|commutative]]) operator <math>\oplus</math>, which can be evaluated in constant time and an input set <math>V = \{v_0 = \begin{pmatrix} e_0^0 \\ \vdots \\ e_0^{m-1}\end{pmatrix}, v_1 = \begin{pmatrix} e_1^0 \\ \vdots \\ e_1^{m-1}\end{pmatrix}, \dots, v_{p-1} = \begin{pmatrix} e_{p-1}^0 \\ \vdots \\ e_{p-1}^{m-1}\end{pmatrix}\}</math>of <math>p</math> vectors with <math>m</math> elements each. The total size of a vector is defined as <math>n</math>. The result <math>r</math> of the operation is the combination of the elements <math>r = \begin{pmatrix} e_0^0 \oplus e_1^0 \oplus \dots \oplus e_{p-1}^0 \\ \vdots \\ e_0^{m-1} \oplus e_1^{m-1} \oplus \dots \oplus e_{p-1}^{m-1}\end{pmatrix} = \begin{pmatrix} \bigoplus_{i=0}^{p-1} e_i^0 \\ \vdots \\ \bigoplus_{i=0}^{p-1} e_i^{m-1} \end{pmatrix}</math> and has to be stored at a specified root processor at the end of the execution. For example, the result of a reduction on the set <math>\{3,5,7,9\}</math>, where all vectors have size one is <math>3 + 5 + 7 + 9 = 24</math>. If the result <math>r</math> has to be available at every processor after the computation has finished, it is often called Allreduce. An optimal sequential linear-time algorithm for reduction can apply the operator successively from front to back, always replacing two vectors with the result of the operation applied to all its elements, thus creating an instance that has one vector less. It needs <math>(p-1)\cdot m </math> steps until only <math>r</math> is left. Sequential algorithms can not perform better than linear time, but parallel algorithms leave some space left to optimize.