Communication-avoiding algorithm: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: journal, template type. Add: eprint, class, s2cid, authors 1-1. Removed proxy/dead URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
OAbot (talk | contribs)
m Open access bot: url-access updated in citation with #oabot.
 
(2 intermediate revisions by 2 users not shown)
Line 14:
 
=== Matrix multiplication ===
<ref>{{Cite journalbook |last1=Jia-Wei |first1=Hong |last2=Kung |first2=H. T. |datetitle=1981Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81 |titlechapter=I/O complexity: The|date=1981 red-blue pebble game|pages=326–333 |chapter-url=http://dx.doi.org/10.1145/800076.802486 |journal=Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing - STOC '81 |___location=New York, New York, USA |publisher=ACM Press |doi=10.1145/800076.802486|s2cid=8410593 }}</ref> Corollary 6.2:
 
{{Math theorem
Line 24:
}}
 
More general results for other numerical linear algebra operations can be found in.<ref>{{Cite journal |last1=Ballard |first1=G. |last2=Carson |first2=E. |last3=Demmel |first3=J. |last4=Hoemmen |first4=M. |last5=Knight |first5=N. |last6=Schwartz |first6=O. |date=May 2014 |title=Communication lower bounds and optimal algorithms for numerical linear algebra |url=http://dx.doi.org/10.1017/s0962492914000038 |journal=Acta Numerica |volume=23 |pages=1–155 |doi=10.1017/s0962492914000038 |s2cid=122513943 |issn=0962-4929|url-access=subscription }}</ref> The following proof is from.<ref>{{Cite arXiv |last1=Demmel |first1=James |last2=Dinh |first2=Grace |date=2018-04-24 |title=Communication-Optimal Convolutional Neural Nets |class=cs.DS |eprint=1802.06905 }}</ref>
 
{{Math proof|title=Proof|proof=
Line 53:
* Measure of computation = Time per [[FLOP]] = γ
* Measure of communication = No. of words of data moved = β
⇒ Total running time = γ&middot;·(no. of [[FLOP]]s) + β&middot;·(no. of words)
 
From the fact that ''β'' >> ''γ'' as measured in time and energy, communication cost dominates computation cost. Technological trends<ref name="DARPA_2008"/> indicate that the relative cost of communication is increasing on a variety of platforms, from [[cloud computing]] to [[supercomputers]] to mobile devices. The report also predicts that gap between [[DRAM]] access time and FLOPs will increase 100× over coming decade to balance power usage between processors and DRAM.<ref name="Demmel_2012"/>
Line 91:
 
for i = 1 to n
{read row i of A into fast memory} - n²<sup>2</sup> reads
for j = 1 to n
{read C(i,j) into fast memory} - n²<sup>2</sup> reads
{read column j of B into fast memory} - n³<sup>3</sup> reads
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory} - n²<sup>2</sup> writes
 
Fast memory may be defined as the local processor memory ([[CPU cache]]) of size M and slow memory may be defined as the DRAM.
Line 103:
Communication cost (reads/writes): ''n''<sup>3</sup> + 3''n''<sup>2</sup> or O(''n''<sup>3</sup>)
 
Since total running time = ''γ''&middot;·O(''n''<sup>3</sup>) + ''β''&middot;·O(''n''<sup>3</sup>) and ''β'' >> ''γ'' the communication cost is dominant. The blocked (tiled) matrix multiplication algorithm<ref name="Demmel_2012"/> reduces this dominant term:
 
==== Blocked (tiled) matrix multiplication ====
Line 111:
for i = 1 to n/b
for j = 1 to n/b
{read block C(i,j) into fast memory} - b²<sup>2</sup> × (n/b)²<sup>2</sup> = n²<sup>2</sup> reads
for k = 1 to n/b
{read block A(i,k) into fast memory} - b²<sup>2</sup> × (n/b)³<sup>3</sup> = n³<sup>3</sup>/b reads
{read block B(k,j) into fast memory} - b²<sup>2</sup> × (n/b)³<sup>3</sup> = n³<sup>3</sup>/b reads
C(i,j) = C(i,j) + A(i,k) * B(k,j) - {do a matrix multiply on blocks}
{write block C(i,j) back to slow memory} - b²<sup>2</sup> × (n/b)²<sup>2</sup> = n²<sup>2</sup> writes
 
Communication cost: 2''n''<sup>3</sup>/''b'' + 2''n''<sup>2</sup> reads/writes << 2''n''<sup>3</sup> arithmetic cost
Line 137:
<ref name="DARPA_2008">Bergman, Keren, et al. "[http://staff.kfupm.edu.sa/ics/ahkhan/Resources/Articles/ExaScale%20Computing/TR-2008-13.pdf Exascale computing study: Technology challenges in exascale computing systems]." [[Defense Advanced Research Projects Agency]] [[Information Processing Techniques Office]] (DARPA IPTO), Tech. Rep 15 (2008).</ref>
<ref name="Shalf_2010">Shalf, John, Sudip Dosanjh, and John Morrison. "Exascale computing technology challenges". High Performance Computing for Computational Science–VECPAR 2010. Springer Berlin Heidelberg, 2011. 1–25.</ref>
<ref name="Frigo_1999">M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, “Cacheoblivious"Cacheoblivious algorithms”algorithms", In FOCS ’99'99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, 1999. IEEE Computer Society.</ref>
<ref name="Toledo_1997">S. Toledo, "[https://web.archive.org/web/20180102191420/https://pdfs.semanticscholar.org/d198/43912d46f6a25de815eadb1fb43d5ca6f61c.pdf Locality of reference in LU Decomposition with partial pivoting]," SIAM J. Matrix Anal. Appl., vol. 18, no. 4, 1997.</ref>
<ref name="Gustavson_1997">F. Gustavson, “Recursion"Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms," IBM Journal of Research and Development, vol. 41, no. 6, pp. 737–755, 1997.</ref>
<ref name="Elmroth_2004">E. Elmroth, F. Gustavson, I. Jonsson, and B. Kagstrom, "[http://www.csc.kth.se/utbildning/kth/kurser/2D1253/matalg06/SIR000003.pdf Recursive blocked algorithms and hybrid data structures for dense matrix library software]," SIAM Review, vol. 46, no. 1, pp. 3–45, 2004.</ref>
<ref name="Grigori_2014">[[Laura Grigori|Grigori, Laura]]. "[http://www.lifl.fr/jncf2014/files/lecture-notes/grigori.pdf Introduction to communication avoiding linear algebra algorithms in high performance computing].</ref>
}}