Communication-avoiding algorithm: Difference between revisions

Content deleted Content added
m Matrix multiplication: links to theorem
OAbot (talk | contribs)
m Open access bot: url-access updated in citation with #oabot.
 
(7 intermediate revisions by 4 users not shown)
Line 14:
 
=== Matrix multiplication ===
<ref>{{Cite journalbook |lastlast1=Jia-Wei |firstfirst1=Hong |last2=Kung |first2=H. T. |datetitle=1981Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81 |titlechapter=I/O complexity: The|date=1981 red-blue pebble game|pages=326–333 |chapter-url=http://dx.doi.org/10.1145/800076.802486 |journal=Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81 |___location=New York, New York, USA |publisher=ACM Press |doi=10.1145/800076.802486|s2cid=8410593 }}</ref> Corollary 6.2:
 
{{Math theorem
| name = Theorem
| note=|math_statement =
| math_statement = Given matrices <math>A, B, C</math> of sizes <math>n\times m, m \times k, n\times k</math>, then <math>AB + C</math> has communication complexity <math>\Omega(\max(mkn/M^{1/2}, mk+kn+mk))</math>.}}
 
This lower bound is achievable by [[Communication-avoiding algorithm#Matrix multiplication example|tiling matrix multiplication]].
More general results for other numerical linear algebra operations can be found in <ref>{{Cite journal |last=Ballard |first=G. |last2=Carson |first2=E. |last3=Demmel |first3=J. |last4=Hoemmen |first4=M. |last5=Knight |first5=N. |last6=Schwartz |first6=O. |date=2014-05 |title=Communication lower bounds and optimal algorithms for numerical linear algebra |url=http://dx.doi.org/10.1017/s0962492914000038 |journal=Acta Numerica |volume=23 |pages=1–155 |doi=10.1017/s0962492914000038 |issn=0962-4929}}</ref>.
}}
 
More general results for other numerical linear algebra operations can be found in .<ref>{{Cite journal |lastlast1=Ballard |firstfirst1=G. |last2=Carson |first2=E. |last3=Demmel |first3=J. |last4=Hoemmen |first4=M. |last5=Knight |first5=N. |last6=Schwartz |first6=O. |date=May 2014-05 |title=Communication lower bounds and optimal algorithms for numerical linear algebra |url=http://dx.doi.org/10.1017/s0962492914000038 |journal=Acta Numerica |volume=23 |pages=1–155 |doi=10.1017/s0962492914000038 |s2cid=122513943 |issn=0962-4929|url-access=subscription }}</ref> The following proof is from.<ref>{{Cite arXiv |last1=Demmel |first1=James |last2=Dinh |first2=Grace |date=2018-04-24 |title=Communication-Optimal Convolutional Neural Nets |class=cs.DS |eprint=1802.06905 }}</ref>
 
{{Math proof|title=Proof|proof=
 
We can draw the computation graph of <math>D = AB + C</math> as a cube of lattice points, each point is of form <math>(i,j,k)</math>. Since <math>D[i,k] = \sum_j A[i,j]B[j,k] + C[i,k]</math>, computing <math>AB+C</math> requires the processor to have access to each point within the cube at least once. So the problem becomes covering the <math>mnk</math> lattice points with a minimal amount of communication.
Line 31 ⟶ 36:
During each segment, the processor has access to at most <math>2M</math> different points from <math>A, B, C</math>.
 
Let <math>E</math> be the set of lattice points covered during this segment. Then by the [[Loomis–Whitney inequality]],
 
<math display="block">|E| \leq \sqrt{|\pi_1(E)||\pi_2(E)||\pi_3(E)|}</math>
with constraint <math>\sum_i |\pi_i(E)| \leq 2M</math>.
 
By the [[inequality of arithmetic and geometric means]], we have <math>|E| \leq \left(\frac 23 M\right)^{3/2}</math>, with extremum reached when <math>\pi_i(E) = \frac 23 M</math>.
<math display="block">\sum_i |\pi_i(E)| \leq 2M.</math>
.
 
By [[inequality of arithmetic and geometric means]], we have
 
<math display="block">|E| \leq \left(\frac 23 M\right)^{3/2}</math>
with extremum reached when <math>\pi_i(E) = \frac 23 M</math>.
 
Thus the arithmetic intensity is bounded above by <math>CM^{1/2}</math> where <math>C = (2/3)^{3/2}</math>, and so the communication is bounded below by <math>\frac{nmk}{CM^{1/2}}</math>.
 
Direct computation verifies that the tiling matrix multiplication algorithm reaches the lower bound.
}}
 
Line 52 ⟶ 53:
* Measure of computation = Time per [[FLOP]] = γ
* Measure of communication = No. of words of data moved = β
⇒ Total running time = γ&middot;·(no. of [[FLOP]]s) + β&middot;·(no. of words)
 
From the fact that ''β'' >> ''γ'' as measured in time and energy, communication cost dominates computation cost. Technological trends<ref name="DARPA_2008"/> indicate that the relative cost of communication is increasing on a variety of platforms, from [[cloud computing]] to [[supercomputers]] to mobile devices. The report also predicts that gap between [[DRAM]] access time and FLOPs will increase 100× over coming decade to balance power usage between processors and DRAM.<ref name="Demmel_2012"/>
Line 66 ⟶ 67:
Energy consumption increases by orders of magnitude as we go higher in the memory hierarchy.<ref name="Shalf_2010"/>
 
United States president Barack Obama cited communication-avoiding algorithms in the FY 2012 Department of Energy budget request to Congress:<ref name="Demmel_2012" /> {{quoteblockquote|New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a [[floating-point arithmetic]] operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.}}
 
== Objectives ==
Line 90 ⟶ 91:
 
for i = 1 to n
{read row i of A into fast memory} - n²<sup>2</sup> reads
for j = 1 to n
{read C(i,j) into fast memory} - n²<sup>2</sup> reads
{read column j of B into fast memory} - n³<sup>3</sup> reads
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory} - n²<sup>2</sup> writes
 
Fast memory may be defined as the local processor memory ([[CPU cache]]) of size M and slow memory may be defined as the DRAM.
Line 102 ⟶ 103:
Communication cost (reads/writes): ''n''<sup>3</sup> + 3''n''<sup>2</sup> or O(''n''<sup>3</sup>)
 
Since total running time = ''γ''&middot;·O(''n''<sup>3</sup>) + ''β''&middot;·O(''n''<sup>3</sup>) and ''β'' >> ''γ'' the communication cost is dominant. The blocked (tiled) matrix multiplication algorithm<ref name="Demmel_2012"/> reduces this dominant term:
 
==== Blocked (tiled) matrix multiplication ====
Line 110 ⟶ 111:
for i = 1 to n/b
for j = 1 to n/b
{read block C(i,j) into fast memory} - b²<sup>2</sup> × (n/b)²<sup>2</sup> = n²<sup>2</sup> reads
for k = 1 to n/b
{read block A(i,k) into fast memory} - b²<sup>2</sup> × (n/b)³<sup>3</sup> = n³<sup>3</sup>/b reads
{read block B(k,j) into fast memory} - b²<sup>2</sup> × (n/b)³<sup>3</sup> = n³<sup>3</sup>/b reads
C(i,j) = C(i,j) + A(i,k) * B(k,j) - {do a matrix multiply on blocks}
{write block C(i,j) back to slow memory} - b²<sup>2</sup> × (n/b)²<sup>2</sup> = n²<sup>2</sup> writes
 
Communication cost: 2''n''<sup>3</sup>/''b'' + 2''n''<sup>2</sup> reads/writes << 2''n''<sup>3</sup> arithmetic cost
Line 136 ⟶ 137:
<ref name="DARPA_2008">Bergman, Keren, et al. "[http://staff.kfupm.edu.sa/ics/ahkhan/Resources/Articles/ExaScale%20Computing/TR-2008-13.pdf Exascale computing study: Technology challenges in exascale computing systems]." [[Defense Advanced Research Projects Agency]] [[Information Processing Techniques Office]] (DARPA IPTO), Tech. Rep 15 (2008).</ref>
<ref name="Shalf_2010">Shalf, John, Sudip Dosanjh, and John Morrison. "Exascale computing technology challenges". High Performance Computing for Computational Science–VECPAR 2010. Springer Berlin Heidelberg, 2011. 1–25.</ref>
<ref name="Frigo_1999">M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, “Cacheoblivious"Cacheoblivious algorithms”algorithms", In FOCS ’99'99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, 1999. IEEE Computer Society.</ref>
<ref name="Toledo_1997">S. Toledo, "[https://web.archive.org/web/20180102191420/https://pdfs.semanticscholar.org/d198/43912d46f6a25de815eadb1fb43d5ca6f61c.pdf Locality of reference in LU Decomposition with partial pivoting]," SIAM J. Matrix Anal. Appl., vol. 18, no. 4, 1997.</ref>
<ref name="Gustavson_1997">F. Gustavson, “Recursion"Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms," IBM Journal of Research and Development, vol. 41, no. 6, pp. 737–755, 1997.</ref>
<ref name="Elmroth_2004">E. Elmroth, F. Gustavson, I. Jonsson, and B. Kagstrom, "[http://www.csc.kth.se/utbildning/kth/kurser/2D1253/matalg06/SIR000003.pdf Recursive blocked algorithms and hybrid data structures for dense matrix library software]," SIAM Review, vol. 46, no. 1, pp. 3–45, 2004.</ref>
<ref name="Grigori_2014">[[Laura Grigori|Grigori, Laura]]. "[http://www.lifl.fr/jncf2014/files/lecture-notes/grigori.pdf Introduction to communication avoiding linear algebra algorithms in high performance computing].</ref>
}}