Hypercube (communication pattern)

The $d$ -dimensional hypercube is a network topology for parallel computers with $2^{d}$ processing elements. The topology allows for an efficient implementation of some basic communication primitives such as Broadcast, All-Reduce and Prefix sum.^[1] The processing elements are numbered from $0$ to $2^{d}-1$ . Each processing elements is then adjacent to processing elements whose numbers differ in exactly one bit. The algorithms described on this page utilize this structure efficiently.

Algorithm Outline

Most of the communication primitives presented in this article share a common template.^[2] Initially, each processing element possesses one message that must reach every other processing element during the course of the algorithm. The following pseudo code sketches the communication steps necessary. Hereby, Initialization, Operation and Output are placeholders that depend on the given communication primitive (see next section).

Input: message  $m$ .
Output: depends on initialization, operation und output.
Initialization
 $s:=m$ 
for  $0\leq k<d$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Send  $s$  to  $y$ 
    Recieve  $m$  from  $y$ 
    Operation $(s,m)$ 
endfor
Output

Each processing element iterates over its neighbors (the expression $i\oplus 2^{k}$ negates the $k$ -th bit in $i$ 's binary representation, therefore obtaining the numbers of its neighbors). During an iteration, each processing element exchanges a message with the neighbor and processes the received message afterwards. The processing operation depends on the communication primitive.

Algorithm outline applied to the

3

-dimensional hypercube. In the first step (before any communication), each processing element possesses one message (blue). Communication is marked red. After each step, the processing elements store the received message, but other operations are also possible.

Communication Primitives

Prefixsum

At the beginng of a prefixsum operation each processing unit $i$ owns a message $m_{i}$ . At the end each processing unit $i$ should recieve $\bigoplus _{0\leq j\leq i}$ , where $\oplus$ is an associative operation. The following pseudo code describes the algorithmn.

input: message  $m_{i}$  of processor  $i$ .
output: prefixsum  $\bigoplus _{0\leq j\leq i}$  of processor  $i$ .
 $x:=m_{i}$  
 $\sigma :=m_{i}$ 
for  $0\leq k\leq d-1$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Send  $\sigma$  to  $y$ 
    Recieve  $m$  from  $y$ 
     $\sigma :=\sigma \oplus m$ 
    if bit  $k$  in  $i$  is set then  $x:=x\oplus m$ 
endfor

Bei der Präfixsumme besitzt jeder Prozessor $i$ zu Beginn eine Nachricht $m_{i}$ . Das Ziel ist es, dass jeder Prozessor $i$ am Ende $\bigoplus _{0\leq j\leq i}$ für eine assoziative Operation $\oplus$ erhält. Der Algorithmus kann wie folgt in die Algorithmenskizze eingebettet werden:

Eingabe: Nachricht  $m_{i}$  auf Prozessor  $i$ .
Ausgabe: Präfixsumme  $\bigoplus _{0\leq j\leq i}$  auf Prozessor  $i$ .
 $x:=m_{i}$  
 $\sigma :=m_{i}$ 
for  $0\leq k\leq d-1$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Sende  $\sigma$  an  $y$ 
    Empfange  $m$  von  $y$ 
     $\sigma :=\sigma \oplus m$ 
    if Bit  $k$  in  $i$  gesetzt then  $x:=x\oplus m$ 
endfor

Ein Hyperwürfel der Dimension $d$ kann in zwei Hyperwürfel der Dimension $d-1$ zerlegt werden. Dazu wird im Weiteren der Teilwürfel aller Knoten, deren Nummer in Binärdarstellung mit 0 beginnen, als 0-Teilwürfel bezeichnet. Die restlichen Knoten bilden analog den 1-Teilwürfel. Nachdem in beiden Teilwürfeln die Präfixsumme berechnet wurde, muss die Gesamtsumme der Elemente im 0-Teilwürfel noch auf alle Elemente des 1-Teilwürfels aufaddiert werden. Das liegt daran, dass nach Definition die Rechner im 0-Teilwürfel einen kleineren Rang als die Rechner im 1-Teilwürfel besitzen. In der Implementierung speichert jeder Knoten deswegen neben seiner Präfixsumme (Variable $x$ ) außerdem die Summe über alle Elemente im Teilwürfel (Variable $\sigma$ ). So können in jedem Schritt alle Knoten im 1-Teilwürfel die Gesamtsumme über den 0-Teilwürfel beziehen.

Bei der Laufzeit ergibt sich ein Faktor von $\log p$ für $T_{\text{start}}$ und ein Faktor von $n\log p$ für $T_{\text{byte}}$ : $T(n,p)=(T_{\text{start}}+nT_{\text{byte}})\log p$ .

Hypercubes of dimension $d$ can be split into two hypercubes of dimension $d-1$ .

Beispiel für eine Präfixsummenberechnung. Jeder Knoten startet mit seiner eigenen Knotennummer als Nachricht, d.h.

m_{i}=i

. Die obere Zeile eines Knotens zeigt

x

, die untere Zeile

\sigma

. Die Operation ist Addition.

Gossip / All-Reduce

Gossip operations start with each processing unit having a message $m_{i}$ . After the operation is finished each processing unit knows the messages of all other processing units, so has the message $x:=m_{0}\cdot m_{1}\dots m_{p}$ . The operation can be implemented following the algorithmn template.

input: message  $x:=m_{i}$  at processing unit $i$ .
output: all messages  $m_{1}\cdot m_{2}\dots m_{p}$ .
 $x:=m_{i}$ 
for  $0\leq k<d$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Send  $x$  to  $y$ 
    Recieve  $x'$  from  $y$ 
     $x:=x\cdot x'$ 
endfor

With each iteration the transfered message doubles in length. This leads to a run-time of $T(n,p)\approx \sum _{j=0}^{d-1}(T_{\text{start}}+n\cdot 2^{j}T_{\text{byte}})=\log(p)T_{\text{start}}+(p-1)nT_{\text{byte}}$ .

The same principle applies to the All-Reduce operations, but instead of concancating the messages, it performs an operation on the two messages. So it is a Reduce operation, where all processing units know the result. In Hypercubes a modified Gossip reduces the number of communications compared to Reduce and Broadcast.

All-to-All

Bei der All-to-All Kommunikation hat jeder Prozessor eine eigene Nachricht für alle anderen Prozessoren.

Eingabe: Nachrichten  $m_{ij}$  auf Prozessor  $i$  an Prozessor  $j$ .
for  $d>k\geq 0$  do
   Erhalte von Prozessor  $i{\text{ XOR }}2^{k}$ :
       alle Nachrichte für meinen  $k$ -dimensionalen Teilwürfel
   Sende an Prozessor  $i{\text{ XOR }}2^{k}$ :
       alle Nachrichte für seinen  $k$ -dimensionalen Teilwürfel
endfor

Eine Nachricht kommt in jedem Iterationsschritt eine Dimension näher an ihr Ziel, sollte sie es noch nicht erreicht haben. Demnach werden nur maximal $d=\log {p}$ viele Schritte benötigt. In jedem Schritt werden $p/2$ Nachrichten verschickt. Für den ersten Schritt liegen genau die Hälfte der Nachrichten nicht im eigenen Teilwürfel. In den allen folgenden Schritten ist der Teilwürfel nur noch halb so groß wie davor, allerdings wurden im vorhergegangenem Schritt genauso viele Nachrichten von einem anderen Prozessor erhalten, die auch für diesen Teilwürfel bestimmt sind.

Insgesamt bedeutet dies eine Laufzeit von: $T(n,p)\approx \log {p}(T_{\text{start}}+{\frac {p}{2}}nT_{\text{byte}})$

ESBT-Broadcast

Der ESBT-Broadcast (Edge-disjoint Spanning Binomial Tree) Algorithmus^[3] ist ein zeitoptimaler Broadcast für Rechnerbündel mit Hyperwürfel-Netztopologie. Dazu wird das Netz ausgehend von der Quelle (im Folgendem der $0$ -Rechner) in $d$ kantendisjunkte Binomialbäume aufgeteilt, so dass jeder Nachbar der Quelle die Wurzel eines Binomialbaums mit $2^{d}-1$ Rechnern ist. Die Quelle zerteilt ihre Nachricht nun in $k$ Teilnachrichten, die dann zyklisch an die Wurzeln der Binomialbäume verteilt werden. Jeder Binomialbaum führt anschließend einen Broadcast aus.

Verteilt die Quelle in jedem Schritt eine Teilnachricht, hat sie nach $k$ Schritten alle Teilnachrichten verteilt. Der Broadcast in einem Binomialbaum benötigt $d$ Schritte. Insgesamt werden somit $k+d$ Schritte benötigt, bis der Broadcast für die letzte Nachricht abgeschlossen ist und die Laufzeit ergibt sich zu $T(n,p,k)=\left({\frac {n}{k}}T_{\text{byte}}+T_{\text{start}}\right)$ . Das optimale $k^{*}={\sqrt {\frac {nd\cdot T_{\text{byte}}}{T_{\text{start}}}}}$ minimiert die Laufzeit zu $T^{*}(n,p)=n\cdot T_{\text{byte}}+\log(p)\cdot T_{\text{start}}+{\sqrt {n\log(p)\cdot T_{\text{start}}\cdot T_{\text{byte}}}}$ .

Aufbau der Binomialbäume

A

3

-dimensional hypercubes with three ESBT embedded.

Die $d$ Binomialbäume können systematisch nach der folgender Vorschrift konstruiert werden. Dazu wird zunächst ein Binomialbaum mit $2^{d}$ Knoten definiert. Anschließend werden durch Translation und Rotation $d$ kantendisjunkte Kopien des Binomialbaums in den Hyperwürfel eingebettet.

Ein einzelner Binomialbaum hat Knoten $0$ als Wurzel. Die Kinder eines Knotens ergeben sich durch Negation der führenden Nullen in der Binärdarstellung der Knotennummer. Der so resultierende Graph ist offensichtlich ein Binomialbaum. Die Kantenmenge des $k$ -ten Binomialbaums im Hyperwürfel erhält man nun wie folgt: auf jeden Knoten wendet man eine XOR-Operation mit $2^{k}$ an und verschiebt die Binärdarstellung der Knotennummer anschließend um $k$ Stellen zyklisch nach rechts. Die so entstehenden $d$ Kopien des ausgehenden Binomialbaums sind kantendisjunkt und erfüllen somit die Voraussetzungen des ESBT-Broadcast Algorithmus.

Referenzen

^ Grama, A.(2003). Introduction to Parallel Computing. Addison Wesley; Auflage: 2 ed. ISBN: 978-0201648652.
^ Foster, I.(1995). Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley; ISBN: 0201575949.
^ Johnsson, S.L.; Ho, C.-T. (1989). "Optimum broadcasting and personalized communication in hypercubes". IEEE Transactions on Computers. 38 (9): 1249–1268. doi:10.1109/12.29465. ISSN 0018-9340.

[1] Grama, A.(2003). Introduction to Parallel Computing. Addison Wesley; Auflage: 2 ed. ISBN: 978-0201648652.

[2] Foster, I.(1995). Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley; ISBN: 0201575949.

[3] Johnsson, S.L.; Ho, C.-T. (1989). "Optimum broadcasting and personalized communication in hypercubes". IEEE Transactions on Computers. 38 (9): 1249–1268. doi:10.1109/12.29465. ISSN 0018-9340.

[1]

[2]

[3]