Content deleted Content added
No edit summary |
m Open access bot: url-access updated in citation with #oabot. |
||
(15 intermediate revisions by 7 users not shown) | |||
Line 1:
{{Use dmy dates|date=June 2020|cs1-dates=y}}
'''Communication-avoiding algorithms''' minimize movement of data within a [[memory hierarchy]] for improving its running-time and energy consumption. These minimize the total of two costs (in terms of time and energy): arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.<ref name="Demmel_2012"/>
== Formal theory ==
=== Two-level memory model ===
A common computational model in analyzing communication-avoiding algorithms is the two-level memory model:
* There is one processor and two levels of memory.
* Level 1 memory is infinitely large. Level 0 memory ("cache") has size <math>M</math>.
* In the beginning, input resides in level 1. In the end, the output resides in level 1.
* Processor can only operate on data in cache.
* The goal is to minimize data transfers between the two levels of memory.
=== Matrix multiplication ===
<ref>{{Cite book |last1=Jia-Wei |first1=Hong |last2=Kung |first2=H. T. |title=Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81 |chapter=I/O complexity |date=1981 |pages=326–333 |chapter-url=http://dx.doi.org/10.1145/800076.802486 |___location=New York, New York, USA |publisher=ACM Press |doi=10.1145/800076.802486|s2cid=8410593 }}</ref> Corollary 6.2:
{{Math theorem
| name = Theorem
| note =
| math_statement = Given matrices <math>A, B, C</math> of sizes <math>n\times m, m \times k, n\times k</math>, then <math>AB + C</math> has communication complexity <math>\Omega(\max(mkn/M^{1/2}, mk+kn+mk))</math>.
This lower bound is achievable by [[Communication-avoiding algorithm#Matrix multiplication example|tiling matrix multiplication]].
}}
More general results for other numerical linear algebra operations can be found in.<ref>{{Cite journal |last1=Ballard |first1=G. |last2=Carson |first2=E. |last3=Demmel |first3=J. |last4=Hoemmen |first4=M. |last5=Knight |first5=N. |last6=Schwartz |first6=O. |date=May 2014 |title=Communication lower bounds and optimal algorithms for numerical linear algebra |url=http://dx.doi.org/10.1017/s0962492914000038 |journal=Acta Numerica |volume=23 |pages=1–155 |doi=10.1017/s0962492914000038 |s2cid=122513943 |issn=0962-4929|url-access=subscription }}</ref> The following proof is from.<ref>{{Cite arXiv |last1=Demmel |first1=James |last2=Dinh |first2=Grace |date=2018-04-24 |title=Communication-Optimal Convolutional Neural Nets |class=cs.DS |eprint=1802.06905 }}</ref>
{{Math proof|title=Proof|proof=
We can draw the computation graph of <math>D = AB + C</math> as a cube of lattice points, each point is of form <math>(i,j,k)</math>. Since <math>D[i,k] = \sum_j A[i,j]B[j,k] + C[i,k]</math>, computing <math>AB+C</math> requires the processor to have access to each point within the cube at least once. So the problem becomes covering the <math>mnk</math> lattice points with a minimal amount of communication.
If <math>M</math> is large, then we can simply load all <math>mn+nk+mk</math> entries then write <math>nk</math> entries. This is uninteresting.
If <math>M</math> is small, then we can divide the minimal-communication algorithm into separate segments. During each segment, it performs exactly <math>M</math> reads to cache, and any number of writes from cache.
During each segment, the processor has access to at most <math>2M</math> different points from <math>A, B, C</math>.
Let <math>E</math> be the set of lattice points covered during this segment. Then by the [[Loomis–Whitney inequality]],
<math display="block">|E| \leq \sqrt{|\pi_1(E)||\pi_2(E)||\pi_3(E)|}</math>
with constraint <math>\sum_i |\pi_i(E)| \leq 2M</math>.
By the [[inequality of arithmetic and geometric means]], we have <math>|E| \leq \left(\frac 23 M\right)^{3/2}</math>, with extremum reached when <math>\pi_i(E) = \frac 23 M</math>.
Thus the arithmetic intensity is bounded above by <math>CM^{1/2}</math> where <math>C = (2/3)^{3/2}</math>, and so the communication is bounded below by <math>\frac{nmk}{CM^{1/2}}</math>.
Direct computation verifies that the tiling matrix multiplication algorithm reaches the lower bound.
}}
== Motivation ==
Line 7 ⟶ 53:
* Measure of computation = Time per [[FLOP]] = γ
* Measure of communication = No. of words of data moved = β
⇒ Total running time = γ
From the fact that ''β'' >> ''γ'' as measured in time and energy, communication cost dominates computation cost. Technological trends<ref name="DARPA_2008"/> indicate that the relative cost of communication is increasing on a variety of platforms, from [[cloud computing]] to [[supercomputers]] to mobile devices. The report also predicts that gap between [[DRAM]] access time and FLOPs will increase 100× over coming decade to balance power usage between processors and DRAM.<ref name="Demmel_2012"/>
Line 19 ⟶ 65:
[[File:Energy_cost_of_data_movement_in_2010_-_on_chip_vs_off_chip.png|thumb|right|Energy cost of data movement in 2010: On-chip vs Off-chip]]
Energy consumption increases by orders of magnitude as we go higher in the memory hierarchy.<ref name="Shalf_2010"/>
United States president Barack Obama cited communication-avoiding algorithms in the FY 2012 Department of Energy budget request to Congress:<ref name="Demmel_2012" /> {{ == Objectives ==
Line 43 ⟶ 91:
for i = 1 to n
{read row i of A into fast memory} - n
for j = 1 to n
{read C(i,j) into fast memory} - n
{read column j of B into fast memory} - n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory} - n
Fast memory may be defined as the local processor memory ([[CPU cache]]) of size M and slow memory may be defined as the DRAM.
Line 55 ⟶ 103:
Communication cost (reads/writes): ''n''<sup>3</sup> + 3''n''<sup>2</sup> or O(''n''<sup>3</sup>)
Since total running time = ''γ
==== Blocked (tiled) matrix multiplication ====
Line 63 ⟶ 111:
for i = 1 to n/b
for j = 1 to n/b
{read block C(i,j) into fast memory} - b
for k = 1 to n/b
{read block A(i,k) into fast memory} - b
{read block B(k,j) into fast memory} - b
C(i,j) = C(i,j) + A(i,k) * B(k,j) - {do a matrix multiply on blocks}
{write block C(i,j) back to slow memory} - b
Communication cost: 2''n''<sup>3</sup>/''b'' + 2''n''<sup>2</sup> reads/writes << 2''n''<sup>3</sup> arithmetic cost
Line 89 ⟶ 137:
<ref name="DARPA_2008">Bergman, Keren, et al. "[http://staff.kfupm.edu.sa/ics/ahkhan/Resources/Articles/ExaScale%20Computing/TR-2008-13.pdf Exascale computing study: Technology challenges in exascale computing systems]." [[Defense Advanced Research Projects Agency]] [[Information Processing Techniques Office]] (DARPA IPTO), Tech. Rep 15 (2008).</ref>
<ref name="Shalf_2010">Shalf, John, Sudip Dosanjh, and John Morrison. "Exascale computing technology challenges". High Performance Computing for Computational Science–VECPAR 2010. Springer Berlin Heidelberg, 2011. 1–25.</ref>
<ref name="Frigo_1999">M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran,
<ref name="Toledo_1997">S. Toledo,
<ref name="Gustavson_1997">F. Gustavson,
<ref name="Elmroth_2004">E. Elmroth, F. Gustavson, I. Jonsson, and B. Kagstrom,
<ref name="Grigori_2014">[[Laura Grigori|Grigori, Laura]]. "[http://www.lifl.fr/jncf2014/files/lecture-notes/grigori.pdf Introduction to communication avoiding linear algebra algorithms in high performance computing].</ref>
}}
|