It is important to note that the send and receive operations have to be executed concurrently for the algorithm to work. The result vector is stored at <math>p_{p-1}</math> at the end. The associated animation shows an execution of the algorithm on vectors of size four with five processing units. Two steps of the animation visualize one parallel execution step. The number of steps in the parallel execution are <math>p + m -2</math>, it takes <math>np-1</math> steps until the last processing unit receives its first element and additional <math>m-1</math> until all elements are received. Therefore the runtime in the BSP-model is <math>T(N,np,m) = (T_{start} + \frac{N}{m}T_{byte})(np+m-2)</math>, assuming that <math>N</math> is the total size of a vector. <math>N</math> is incorporated into the formula because it is possible to adapt the granularity in which the operator is computed. For example, a problem instance with vectors of size four can be handled by splitting the vectors into the first two and last two elements, which are always transmitted and computed together. In this case, double the volume is send each step, but the number of steps has roughly halved. Effectively, the example instance is just handled as an instance with vectors of size two by the algorithm. The runtime <math>T(np)</math> for this approach depends on the value of <math>N</math>, which can be optimized when knowing <math>T_{start}</math> and <math>T_{byte}</math>. It is optimal for <math>N = \sqrt{\frac{m(np-2)T_{byte}}{T_{start}}}</math>.