Content deleted Content added
m →Matrix multiplication: links to theorem |
m Open access bot: url-access updated in citation with #oabot. |
||
(7 intermediate revisions by 4 users not shown) | |||
Line 14:
=== Matrix multiplication ===
<ref>{{Cite
{{Math theorem
| name = Theorem | note | math_statement = Given matrices <math>A, B, C</math> of sizes <math>n\times m, m \times k, n\times k</math>, then <math>AB + C</math> has communication complexity <math>\Omega(\max(mkn/M^{1/2}, mk+kn+mk))</math>.
This lower bound is achievable by [[Communication-avoiding algorithm#Matrix multiplication example|tiling matrix multiplication]].
More general results for other numerical linear algebra operations can be found in <ref>{{Cite journal |last=Ballard |first=G. |last2=Carson |first2=E. |last3=Demmel |first3=J. |last4=Hoemmen |first4=M. |last5=Knight |first5=N. |last6=Schwartz |first6=O. |date=2014-05 |title=Communication lower bounds and optimal algorithms for numerical linear algebra |url=http://dx.doi.org/10.1017/s0962492914000038 |journal=Acta Numerica |volume=23 |pages=1–155 |doi=10.1017/s0962492914000038 |issn=0962-4929}}</ref>.▼
}}
▲More general results for other numerical linear algebra operations can be found in
{{Math proof|title=Proof|proof=
We can draw the computation graph of <math>D = AB + C</math> as a cube of lattice points, each point is of form <math>(i,j,k)</math>. Since <math>D[i,k] = \sum_j A[i,j]B[j,k] + C[i,k]</math>, computing <math>AB+C</math> requires the processor to have access to each point within the cube at least once. So the problem becomes covering the <math>mnk</math> lattice points with a minimal amount of communication.
Line 31 ⟶ 36:
During each segment, the processor has access to at most <math>2M</math> different points from <math>A, B, C</math>.
Let <math>E</math> be the set of lattice points covered during this segment. Then by the [[Loomis–Whitney inequality]],
<math display="block">|E| \leq \sqrt{|\pi_1(E)||\pi_2(E)||\pi_3(E)|}</math>
By the [[inequality of arithmetic and geometric means]], we have <math>|E| \leq \left(\frac 23 M\right)^{3/2}</math>, with extremum reached when <math>\pi_i(E) = \frac 23 M</math>.
Thus the arithmetic intensity is bounded above by <math>CM^{1/2}</math> where <math>C = (2/3)^{3/2}</math>, and so the communication is bounded below by <math>\frac{nmk}{CM^{1/2}}</math>.
Direct computation verifies that the tiling matrix multiplication algorithm reaches the lower bound.
}}
Line 52 ⟶ 53:
* Measure of computation = Time per [[FLOP]] = γ
* Measure of communication = No. of words of data moved = β
⇒ Total running time = γ
From the fact that ''β'' >> ''γ'' as measured in time and energy, communication cost dominates computation cost. Technological trends<ref name="DARPA_2008"/> indicate that the relative cost of communication is increasing on a variety of platforms, from [[cloud computing]] to [[supercomputers]] to mobile devices. The report also predicts that gap between [[DRAM]] access time and FLOPs will increase 100× over coming decade to balance power usage between processors and DRAM.<ref name="Demmel_2012"/>
Line 66 ⟶ 67:
Energy consumption increases by orders of magnitude as we go higher in the memory hierarchy.<ref name="Shalf_2010"/>
United States president Barack Obama cited communication-avoiding algorithms in the FY 2012 Department of Energy budget request to Congress:<ref name="Demmel_2012" /> {{
== Objectives ==
Line 90 ⟶ 91:
for i = 1 to n
{read row i of A into fast memory} - n
for j = 1 to n
{read C(i,j) into fast memory} - n
{read column j of B into fast memory} - n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory} - n
Fast memory may be defined as the local processor memory ([[CPU cache]]) of size M and slow memory may be defined as the DRAM.
Line 102 ⟶ 103:
Communication cost (reads/writes): ''n''<sup>3</sup> + 3''n''<sup>2</sup> or O(''n''<sup>3</sup>)
Since total running time = ''γ''
==== Blocked (tiled) matrix multiplication ====
Line 110 ⟶ 111:
for i = 1 to n/b
for j = 1 to n/b
{read block C(i,j) into fast memory} - b
for k = 1 to n/b
{read block A(i,k) into fast memory} - b
{read block B(k,j) into fast memory} - b
C(i,j) = C(i,j) + A(i,k) * B(k,j) - {do a matrix multiply on blocks}
{write block C(i,j) back to slow memory} - b
Communication cost: 2''n''<sup>3</sup>/''b'' + 2''n''<sup>2</sup> reads/writes << 2''n''<sup>3</sup> arithmetic cost
Line 136 ⟶ 137:
<ref name="DARPA_2008">Bergman, Keren, et al. "[http://staff.kfupm.edu.sa/ics/ahkhan/Resources/Articles/ExaScale%20Computing/TR-2008-13.pdf Exascale computing study: Technology challenges in exascale computing systems]." [[Defense Advanced Research Projects Agency]] [[Information Processing Techniques Office]] (DARPA IPTO), Tech. Rep 15 (2008).</ref>
<ref name="Shalf_2010">Shalf, John, Sudip Dosanjh, and John Morrison. "Exascale computing technology challenges". High Performance Computing for Computational Science–VECPAR 2010. Springer Berlin Heidelberg, 2011. 1–25.</ref>
<ref name="Frigo_1999">M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran,
<ref name="Toledo_1997">S. Toledo,
<ref name="Gustavson_1997">F. Gustavson,
<ref name="Elmroth_2004">E. Elmroth, F. Gustavson, I. Jonsson, and B. Kagstrom,
<ref name="Grigori_2014">[[Laura Grigori|Grigori, Laura]]. "[http://www.lifl.fr/jncf2014/files/lecture-notes/grigori.pdf Introduction to communication avoiding linear algebra algorithms in high performance computing].</ref>
}}
|