== Definition ==
Formally, the reduction takes an associative (but not necessarily commutative) operator <math>\oplus</math>, which can be evaluated in constant time and an input set <math>V = \{v_0 = \begin{pmatrix} e_0^0 \\ \vdots \\ e_0^{m-1}\end{pmatrix}, v_1 = \begin{pmatrix} e_1^0 \\ \vdots \\ e_1^{m-1}\end{pmatrix}, \dots, v_{p-1} = \begin{pmatrix} e_{p-1}^0 \\ \vdots \\ e_{p-1}^{m-1}\end{pmatrix}\}</math>of <math>p</math> vectors ofwith <math>m</math> elements each. The total size of a vector is defined as <math>mn</math>. The result <math>r</math> of the operation is the combination of the elements <math>r = \begin{pmatrix} e_0^0 \oplus e_1^0 \oplus \dots \oplus e_{p-1}^0 \\ \vdots \\ e_0^{m-1} \oplus e_1^{m-1} \oplus \dots \oplus e_{p-1}^{m-1}\end{pmatrix} = \begin{pmatrix} \bigoplus_{i=0}^{p-1} e_i^0 \\ \vdots \\ \bigoplus_{i=0}^{p-1} e_i^{m-1} \end{pmatrix}</math> and has to be stored at a specified root processor at the end of the execution. For example, the result of a reduction on the set <math>\{3,5,7,9\}</math>, where all vectors have size one is <math>3 + 5 + 7 + 9 = 24</math>. If the result <math>r</math> has to be available at every processor after the computation has finished, it is often called Allreduce. An optimal sequential linear-time algorithm for reduction can apply the operator successively from front to back, always replacing two vectors with the result of the operation applied to all its elements, thus creating an instance that has one vector less. It needs <math>(p-1)\cdot m </math> steps until only <math>r</math> is left. Sequential algorithms can not perform better than linear time, but parallel algorithms leave some space left to optimize.
== Binomial tree algorithms ==
==== Runtime analysis ====
A simple analysis for the algorithm uses the BSP-model and incorporates the time <math>lT_{start}</math> needed to initiate communication and <math>gT_{byte}</math> the time needed to send a packet/byte. Then the resulting runtime is <math>\Theta((lT_{start} + mn \cdot gT_{byte})\cdot log(p))</math>, as <math>m</math> elements of a vector are send in each iteration and have size <math>n</math> in total.
== Pipeline-algorithm ==
:::: '''receive''' <math>x_{i-1}^{k+i-1}</math> '''from''' <math>p_{i-1}</math>
:::: <math>x_{i}^{k+i-1} \gets x_{i}^{k+i-1} \oplus x_{i-1}^{k+i-1}</math>
It is important to note that the send and receive operations have to be executed concurrently for the algorithm to work. The result vector is stored at <math>p_{p-1}</math> at the end. The associated animation shows an execution of the algorithm on vectors of size four with five processing units. Two steps of the animation visualize one parallel execution step. The number of steps in the parallel execution are <math>p + m -2</math>, it takes <math>p-1</math> steps until the last processing unit receives its first element and additional <math>m-1</math> until all elements are received. Therefore the runtime in the BSP-model is <math>T(n,p,m) = (T_{start} + \frac{n}{m}T_{byte})(p+m-2)</math>, assuming that <math>n</math> is the total byte-size of alla elements togethervector.
Although <math>nm</math> ishas incorporateda intofixed the formula becausevalue, it is possible to adaptlogically thegroup granularityelements inof whicha thevector operatortogether isand computedreduce <math>m</math>. For example, a problem instance with vectors of size four can be handled by splitting the vectors into the first two and last two elements, which are always transmitted and computed together. In this case, double the volume is send each step, but the number of steps has roughly halved. It means that the parameter <math>m</math> is halfed, while the total byte-size <math>n</math> stays the same. The runtime <math>T(p)</math> for this approach depends on the value of <math>m</math>, which can be optimized if <math>T_{start}</math> and <math>T_{byte}</math> are known. It is optimal for <math>m = \sqrt{\frac{n \cdot (p-2)T_{byte}}{T_{start}}}</math>, assuming that this results in a smaller <math>m</math> that divides the original one.
== Pipelined tree ==
=== Runtime ===
Each iteration of the algorithm takes at most time <math>p\frac{n}{m} \cdot T_{byte} + T_{start}</math>. The height of the tree factors into the time it needs to fill the pipeline and for Fibonacci trees it is known to be about <math>h = log_{\phi}p</math> where <math>\phi = \frac{1 + \sqrt{5}}{2}</math> is the golden ratio. Once the pipeline is filled, all processors are active in each step. Because inner nodes have two children, they have to receive <math>2 \cdot m</math> elements. Therefore the runtime of the algorithm is <math>T(n,p,m) \approx (p\frac{n}{m} \cdot T_{byte} + T_{start})(h + 2 \cdot k - 2)</math>. It is minimal if the sizenumber of theelements in a vectorsvector is chosen such that <math>m = \sqrt{\frac{n \cdot (h-3)T_{byte}}{3 \cdot T_{start}}}</math> where <math>n</math> is the total number of elements.
== Applications ==
|