The <math>d</math>-dimensional [[hypercube]] is a network topology for parallel computers with <math>2^d</math> processing elements. The topology allows for an efficient implementation of some basic communication primitives such as [[Broadcasting_Broadcasting (networking)|Broadcast]], All-[[Reduce_Reduce (parallel_patternparallel pattern)|Reduce]], and [[Prefix sum]].<ref>Grama, A.(2003). Introduction to Parallel Computing. Addison Wesley; Auflage: 2 ed. {{ISBN: |978-0201648652}}.</ref> The processing elements are numbered from <math>0</math> tothrough <math>2^d - 1</math>. Each processing elementselement is then adjacent to processing elements whose numbers differ in exactlyone and only one bit. The algorithms described onin this page utilize this structure efficiently.
== Algorithm Outlineoutline ==
Most of the communication primitives presented in this article share a common template.<ref>Foster, I.(1995). Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley; {{ISBN: |0201575949}}.</ref> Initially, each processing element possesses one message that must reach every other processing element during the course of the algorithm. The following pseudo code sketches the communication steps necessary. Hereby, '''Initialization''', '''Operation''', and '''Output''' are placeholders that depend on the given communication primitive (see next section).
'''Input:''' message <math>m</math>.
<math>y := i \text{ XOR } 2^k</math>
'''Send''' <math>s</math> '''to''' <math>y</math>
'''RecieveReceive''' <math>m</math> '''from''' <math>y</math>
'''Operation'''<math>(s, m)</math>
'''endfor'''
'''Output'''
Each processing element iterates over its neighbors (the expression <math>i \text{ XOR } 2^k</math> negates the <math>k</math>-th bit in <math>i</math>'s binary representation, therefore obtaining the numbers of its neighbors). DuringIn aneach iteration, each processing element exchanges a message with the neighbor and processes the received message afterwards. The processing operation depends on the communication primitive.
[[File:Hypergraph Communication Pattern.png|thumb|Algorithm outline applied to the <math>3</math>-dimensional hypercube. In the first step (before any communication), each processing element possesses one message (blue). Communication is marked red. After each step, the processing elements store the received message, but other operations are also possible.]]
== Communication Primitivesprimitives ==
=== PrefixsumPrefix sum ===
AtIn the beginngbeginning of a [[Prefix_sum|prefix sum]] operation, each processing unitelement <math>i</math> owns a message <math>m_i</math>. AtThe thegoal end each processing unit <math>i</math>is shouldto recievecompute <math>\bigoplus_{0 \le j \le i} m_j</math>, where <math>\oplus</math> is an associative operation. The following pseudo code describes the algorithmnalgorithm.
'''inputInput''': message <math>m_i</math> of processor <math>i</math>.
'''outputOutput''': prefixsumprefix sum <math>\bigoplus_{0 \le j \le i} m_j</math> of processor <math>i</math>.
<math>x := m_i</math>
<math>\sigma := m_i</math>
<math>y := i \text{ XOR } 2^k</math>
'''Send''' <math>\sigma</math> '''to''' <math>y</math>
'''RecieveReceive''' <math>m</math> '''from''' <math>y</math>
<math>\sigma := \sigma \oplus m</math>
'''if''' bit <math>k</math> in <math>i</math> is set '''then''' <math>x := x \oplus m</math>
'''endfor'''
HypercubesThe algorithm works as follows. Observe that hypercubes of dimension <math>d</math> can be split into two hypercubes of dimension <math>d - 1</math>. InRefer the followingto the sub cube ofcontaining allnodes verticeswith startinga withleading 0 willas be reffered to asthe 0-subcubesub cube and the sub cube consisting of all vertices startingnodes with a leading 1 as 1-subcubesub cube. Once both sub cubes have calculated the prefix sum, the sum over all elements in the 0-subcubesub cube has to be added to the every element in the 1-subcubesub cube, becausesince every processing element in the 0-subcubesub cube has a lower rank than the processing elements in the 1-subcubesub cube. InThe thepseudo implementationcode eachstores vertex not saves histhe prefix sum (in variable <math>x</math>) and the sum over all elementsnodes in hisa sub cube (in variable <math>\sigma</math>).
This makes it possible for all nodes in 1-sub cube to receive the sum over the 0-sub cube in every step.
BeiThis derresults Laufzeitin ergibta sichfactor ein Faktor vonof <math>\log p</math> fürfor <math>T_\text{start}</math> undand eina Faktorfactor vonof <math>n\log p</math> fürfor <math>T_\text{byte}</math>: <math>T(n,p) = (T_\text{start} + nT_\text{byte})\log p</math>.
[[File:Hypergraph Communication Steps for Prefix Sum.png|thumb|Example for a prefix sum calculation. Upper number: numbertentatetive prefix sum (variable <math>m_ix</math> that each processing element contributes to the prefix sum). Lower number: prefix sum (atover theall endelements ofin the computationsub cube (variable <math>\sigma</math>).]]
=== GossipAll-gather / Allall-Reducereduce ===
'''GossipAll-gather''' operations start with each processing element having a message <math>m_i</math>. AfterThe goal of the operation is finishedfor each processing unitelement to knowsknow the messages of all other processing elements, with messagei.e. <math>x := m_0 \cdot m_1 \dots m_p</math> where <math>\cdot</math> is concatenation. The operation can be implemented following the algorithm template.
'''inputInput''': message <math>x := m_i</math> at processing unit <math>i</math>.
'''outputOutput''': all messages <math>m_1 \cdot m_2 \dots m_p</math>.
<math>x := m_i</math>
'''for''' <math>0 \le k < d</math> '''do'''
'''endfor'''
With each iteration, the transferred message doubles in length. This leads to a run-timeruntime of <math>T(n,p) \approx \sum_{j=0}^{d-1}(T_\text{start} + n \cdot 2^jT_\text{byte})= \log(p) T_\text{start} + (p-1)nT_\text{byte}</math>.
The same principle can be applied to the '''All-Reduce''' operations, but instead of concatenating the messages, it performs ana reduction operation on the two messages. So it is a '''Reduce''' operation, where all processing units know the result. InCompared Hypercubesto a modifiednormal '''Gossip'''reduce reducesoperation thefollowed numberby ofa communicationsbroadcast, comparedAll-Reduce toin Reducehypercubes reduces the number of andcommunication Broadcaststeps.
=== All-to-Allall ===
Here every processing element has a unique message for all other processing elements.
'''inputInput:''' message <math>m_{ij}</math> at processing element <math>i</math> to processing element <math>j</math>.
'''for''' <math>d > k \geq 0</math> '''do'''
'''Receive''' from processing element <math>i \text{ XOR } 2^k</math>:
all messages for my <math>k</math>-dimensional sub cube
'''Send''' to processing element <math>i \text{ XOR } 2^k</math>:
all messages for hisits <math>k</math>-dimensional sub cube
'''endfor'''
With each iteration a messages comes closer to its destination by one dimension, if it hasn't arrived yet. SoHence, thereall onlymessages have reached their target after at most <math>d = \log{p}</math> steps needed. In every step, <math>p / 2</math> messages are sent.: Inin the first iteration, half of the messages aren't meant for the own sub cube. In every following step, the sub cube is only half the size as before, but in the previous step exactly the same number of messages arrived from another processing element.
This results in a run-time of <math>T(n,p) \approx \log{p} (T_\text{start} + \frac{p}{2}nT_\text{byte})</math>.
== ESBT-Broadcastbroadcast ==
The ESBT-broadcast (Edge-disjoint Spanning Binomial Tree) algorithm<ref>{{cite journal|last1=Johnsson|first1=S.L.|last2=Ho|first2=C.-T.|title=Optimum broadcasting and personalized communication in hypercubes|journal=IEEE Transactions on Computers|volume=38|issue=9|year=1989|pages=1249–1268|issn=001893400018-9340|doi=10.1109/12.29465}}</ref> is a pipelined broadcast algorithm with optimal runtime for clusters with hypercube network topology. The algorithm embeds <math>d</math> edge-disjoint binomial trees in the hypercube, such that each neighbor of processing element <math>0</math> is the root of a spanning binomial tree on <math>2^d - 1</math> nodes. To broadcast a message, the source node splits its message into <math>k</math> chunks of equal size and cyclically sends them to the roots of the binomial trees. Upon receiving a chunk, the binomial trees broadcast it.
=== Runtime ===
The runtime of this algorithm is as follows. In each step, the source node sends one of its <math>k</math> chunks to a binomial tree. Broadcasting the chunk within the binomial tree takes <math>d</math> steps. Thus, it takes <math>k</math> steps to distribute all chunks and additionally <math>d</math> steps until the last binomial tree broadcast has finished, resulting in <math>k + d</math> steps overall. Therefore, the runtime for a message of length <math>n</math> is <math>T(n, p, k) = \left(\frac{n}{k} T_\text{byte} + T_\text{start} \right) (k + d)</math>. With the optimal chunk size <math>k^* = \sqrt{\frac{nd \cdot T_\text{byte}}{T_\text{start}}}</math>, the optimal runtime of the algorithm is <math>T^*(n, p) = n \cdot T_\text{byte} + \log(p) \cdot T_\text{start} + \sqrt{n \log(p) \cdot T_\text{start} \cdot T_\text{byte}}</math>.
=== Construction of the Binomialbinomial Treestrees ===
[[File:HypergraphESBT.png|thumb|A <math>3</math>-dimensional hypercubes with three ESBT embedded.]]
This section describes how to construct the binomial trees systematically. First, construct a single binomial spanning tree von <math>2^d</math> nodes as follows. Number the nodes from <math>0</math> to <math>2^d - 1</math> and consider their binary representation. Then the children of each nodes are obtained by negating single leading zeroes. This results in a single binomial spanning tree. To obtain <math>d</math> edge-disjoint copies of the tree, translate and rotate the nodes: for the <math>k</math>-th copy of the tree, apply a XOR operation with <math>2^k</math> to each node. AfterwardsSubsequently, right -rotate all nodes by <math>k</math> digits. The resulting binomial trees are edge-disjoint and therefore, fulfill the requirements for the ESBT-broadcasting algorithm.
== ReferenzenSee also ==
* [[Hypercube internetwork topology]]
==References==
{{Reflist}}
[[Category:Parallel computing]]
|