Communication-avoiding algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 23:22, 5 August 2020 edit Michael Hardy (talk \| contribs) Administrators 210,577 edits No edit summary ← Previous edit		Latest revision as of 00:20, 20 June 2025 edit undo OAbot (talk \| contribs) Bots 643,717 edits m Open access bot: url-access updated in citation with #oabot.
(15 intermediate revisions by 7 users not shown)
Line 1: {{Use dmy dates\|date=June 2020\|cs1-dates=y}} '''Communication-avoiding algorithms''' minimize movement of data within a [[memory hierarchy]] for improving its running-time and energy consumption. These minimize the total of two costs (in terms of time and energy): arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.<ref name="Demmel_2012"/> == Formal theory == === Two-level memory model === A common computational model in analyzing communication-avoiding algorithms is the two-level memory model: * There is one processor and two levels of memory. * Level 1 memory is infinitely large. Level 0 memory ("cache") has size <math>M</math>. * In the beginning, input resides in level 1. In the end, the output resides in level 1. * Processor can only operate on data in cache. * The goal is to minimize data transfers between the two levels of memory. === Matrix multiplication === <ref>{{Cite book \|last1=Jia-Wei \|first1=Hong \|last2=Kung \|first2=H. T. \|title=Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81 \|chapter=I/O complexity \|date=1981 \|pages=326–333 \|chapter-url=http://dx.doi.org/10.1145/800076.802486 \|___location=New York, New York, USA \|publisher=ACM Press \|doi=10.1145/800076.802486\|s2cid=8410593 }}</ref> Corollary 6.2: {{Math theorem \| name = Theorem \| note = \| math_statement = Given matrices <math>A, B, C</math> of sizes <math>n\times m, m \times k, n\times k</math>, then <math>AB + C</math> has communication complexity <math>\Omega(\max(mkn/M^{1/2}, mk+kn+mk))</math>. This lower bound is achievable by [[Communication-avoiding algorithm#Matrix multiplication example\|tiling matrix multiplication]]. }} More general results for other numerical linear algebra operations can be found in.<ref>{{Cite journal \|last1=Ballard \|first1=G. \|last2=Carson \|first2=E. \|last3=Demmel \|first3=J. \|last4=Hoemmen \|first4=M. \|last5=Knight \|first5=N. \|last6=Schwartz \|first6=O. \|date=May 2014 \|title=Communication lower bounds and optimal algorithms for numerical linear algebra \|url=http://dx.doi.org/10.1017/s0962492914000038 \|journal=Acta Numerica \|volume=23 \|pages=1–155 \|doi=10.1017/s0962492914000038 \|s2cid=122513943 \|issn=0962-4929\|url-access=subscription }}</ref> The following proof is from.<ref>{{Cite arXiv \|last1=Demmel \|first1=James \|last2=Dinh \|first2=Grace \|date=2018-04-24 \|title=Communication-Optimal Convolutional Neural Nets \|class=cs.DS \|eprint=1802.06905 }}</ref> {{Math proof\|title=Proof\|proof= We can draw the computation graph of <math>D = AB + C</math> as a cube of lattice points, each point is of form <math>(i,j,k)</math>. Since <math>D[i,k] = \sum_j A[i,j]B[j,k] + C[i,k]</math>, computing <math>AB+C</math> requires the processor to have access to each point within the cube at least once. So the problem becomes covering the <math>mnk</math> lattice points with a minimal amount of communication. If <math>M</math> is large, then we can simply load all <math>mn+nk+mk</math> entries then write <math>nk</math> entries. This is uninteresting. If <math>M</math> is small, then we can divide the minimal-communication algorithm into separate segments. During each segment, it performs exactly <math>M</math> reads to cache, and any number of writes from cache. During each segment, the processor has access to at most <math>2M</math> different points from <math>A, B, C</math>. Let <math>E</math> be the set of lattice points covered during this segment. Then by the [[Loomis–Whitney inequality]], <math display="block">\|E\| \leq \sqrt{\|\pi_1(E)\|\|\pi_2(E)\|\|\pi_3(E)\|}</math> with constraint <math>\sum_i \|\pi_i(E)\| \leq 2M</math>. By the [[inequality of arithmetic and geometric means]], we have <math>\|E\| \leq \left(\frac 23 M\right)^{3/2}</math>, with extremum reached when <math>\pi_i(E) = \frac 23 M</math>. Thus the arithmetic intensity is bounded above by <math>CM^{1/2}</math> where <math>C = (2/3)^{3/2}</math>, and so the communication is bounded below by <math>\frac{nmk}{CM^{1/2}}</math>. Direct computation verifies that the tiling matrix multiplication algorithm reaches the lower bound. }} == Motivation == Line 7 ⟶ 53: * Measure of computation = Time per [[FLOP]] = γ * Measure of communication = No. of words of data moved = β ⇒ Total running time = γ~~·~~·(no. of [[FLOP]]s) + β~~·~~·(no. of words) From the fact that ''β'' >> ''γ'' as measured in time and energy, communication cost dominates computation cost. Technological trends<ref name="DARPA_2008"/> indicate that the relative cost of communication is increasing on a variety of platforms, from [[cloud computing]] to [[supercomputers]] to mobile devices. The report also predicts that gap between [[DRAM]] access time and FLOPs will increase 100× over coming decade to balance power usage between processors and DRAM.<ref name="Demmel_2012"/> Line 19 ⟶ 65: [[File:Energy_cost_of_data_movement_in_2010_-_on_chip_vs_off_chip.png\|thumb\|right\|Energy cost of data movement in 2010: On-chip vs Off-chip]] Energy consumption increases by orders of magnitude as we go higher in the memory hierarchy.<ref name="Shalf_2010"/> United States president Barack Obama cited communication-avoiding algorithms in the FY 2012 Department of Energy budget request to Congress:<ref name="Demmel_2012" /> {{~~quote~~blockquote\|New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a [[floating -point arithmetic]] operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.}} == Objectives == Line 43 ⟶ 91: for i = 1 to n {read row i of A into fast memory} - n²<sup>2</sup> reads for j = 1 to n {read C(i,j) into fast memory} - n²<sup>2</sup> reads {read column j of B into fast memory} - n³<sup>3</sup> reads for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory} - n²<sup>2</sup> writes Fast memory may be defined as the local processor memory ([[CPU cache]]) of size M and slow memory may be defined as the DRAM. Line 55 ⟶ 103: Communication cost (reads/writes): ''n''<sup>3</sup> + 3''n''<sup>2</sup> or O(''n''<sup>3</sup>) Since total running time = ''γ''·O(''n³''<sup>3</sup>) + ''β''·O(''n³''<sup>3</sup>) and ''β'' >> ''γ'' the communication cost is dominant. The blocked (tiled) matrix multiplication algorithm<ref name="Demmel_2012"/> reduces this dominant term: ==== Blocked (tiled) matrix multiplication ==== Line 63 ⟶ 111: for i = 1 to n/b for j = 1 to n/b {read block C(i,j) into fast memory} - b²<sup>2</sup> × (n/b)²<sup>2</sup> = n²<sup>2</sup> reads for k = 1 to n/b {read block A(i,k) into fast memory} - b²<sup>2</sup> × (n/b)³<sup>3</sup> = n³<sup>3</sup>/b reads {read block B(k,j) into fast memory} - b²<sup>2</sup> × (n/b)³<sup>3</sup> = n³<sup>3</sup>/b reads C(i,j) = C(i,j) + A(i,k) * B(k,j) - {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} - b²<sup>2</sup> × (n/b)²<sup>2</sup> = n²<sup>2</sup> writes Communication cost: 2''n''<sup>3</sup>/''b'' + 2''n''<sup>2</sup> reads/writes << 2''n''<sup>3</sup> arithmetic cost Line 89 ⟶ 137: <ref name="DARPA_2008">Bergman, Keren, et al. "[http://staff.kfupm.edu.sa/ics/ahkhan/Resources/Articles/ExaScale%20Computing/TR-2008-13.pdf Exascale computing study: Technology challenges in exascale computing systems]." [[Defense Advanced Research Projects Agency]] [[Information Processing Techniques Office]] (DARPA IPTO), Tech. Rep 15 (2008).</ref> <ref name="Shalf_2010">Shalf, John, Sudip Dosanjh, and John Morrison. "Exascale computing technology challenges". High Performance Computing for Computational Science–VECPAR 2010. Springer Berlin Heidelberg, 2011. 1–25.</ref> <ref name="Frigo_1999">M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, ~~“Cacheoblivious~~"Cacheoblivious ~~algorithms”~~algorithms", In FOCS ~~’99~~'99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, 1999. IEEE Computer Society.</ref> <ref name="Toledo_1997">S. Toledo, “"[https://web.archive.org/web/20180102191420/https://pdfs.semanticscholar.org/d198/43912d46f6a25de815eadb1fb43d5ca6f61c.pdf Locality of reference in LU Decomposition with partial pivoting],”" SIAM J. Matrix Anal. Appl., vol. 18, no. 4, 1997.</ref> <ref name="Gustavson_1997">F. Gustavson, ~~“Recursion~~"Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms,”" IBM Journal of Research and Development, vol. 41, no. 6, pp. 737–755, 1997.</ref> <ref name="Elmroth_2004">E. Elmroth, F. Gustavson, I. Jonsson, and B. Kagstrom, “"[http://www.csc.kth.se/utbildning/kth/kurser/2D1253/matalg06/SIR000003.pdf Recursive blocked algorithms and hybrid data structures for dense matrix library software],”" SIAM Review, vol. 46, no. 1, pp. 3–45, 2004.</ref> <ref name="Grigori_2014">[[Laura Grigori\|Grigori, Laura]]. "[http://www.lifl.fr/jncf2014/files/lecture-notes/grigori.pdf Introduction to communication avoiding linear algebra algorithms in high performance computing].</ref> }}