Revision as of 00:28, 29 July 2023 edit AristippusSer (talk \| contribs) 222 edits Reverted 1 edit by 88.235.200.138 (talk): Unexplained removal of content Tags: Twinkle Undo ← Previous edit		Revision as of 02:55, 8 August 2023 edit undo Citation bot (talk \| contribs) Bots 5,862,714 edits Alter: title, template type. Add: chapter-url, date, chapter. Removed or converted URL. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox3 \| #UCB_webform_linked 1249/2306 Next edit →
Line 242: On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the ''communication bandwidth''. The naïve algorithm using three nested loops uses {{math\|Ω(''n''<sup>3</sup>)}} communication bandwidth. [[Cannon's algorithm]], also known as the ''2D algorithm'', is a [[communication-avoiding algorithm]] that partitions each input matrix into a block matrix whose elements are submatrices of size {{math\|{{sqrt\|''M''/3}}}} by {{math\|{{sqrt\|''M''/3}}}}, where {{math\|''M''}} is the size of fast memory.<ref>{{cite thesis \|first=Lynn Elliot \|last=Cannon \|title=A cellular computer to implement the Kalman Filter Algorithm \|date=14 July 1969 \|type=Ph.D. \|publisher=Montana State University \|url=https://dl.acm.org/doi/abs/10.5555/905686 }}</ref> The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth to {{math\|''O''(''n''<sup>3</sup>/{{sqrt\|''M''}})}}, which is asymptotically optimal (for algorithms performing {{math\|Ω(''n''<sup>3</sup>)}} computation).<ref>{{cite ~~journal~~book\|last1=Hong\|first1=J. W.\|first2=H. T. \|last2=Kung\|title=Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81 \|chapter=I/O complexity: ~~The red-blue pebble~~\|date=1981 ~~game~~\|journal=STOC '81: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing\|year=1981\|pages=326–333\|chapter-url=https://apps.dtic.mil/dtic/tr/fulltext/u2/a104739.pdf\|archive-url=https://web.archive.org/web/20191215182754/https://apps.dtic.mil/dtic/tr/fulltext/u2/a104739.pdf\|url-status=live\|archive-date=December 15, 2019 \|doi=10.1145/800076.802486\|s2cid=8410593 }}</ref><ref name=irony>{{cite journal\|last1=Irony\|first1=Dror\|first2=Sivan \|last2=Toledo \|first3=Alexander \|last3=Tiskin \|title=Communication lower bounds for distributed-memory matrix multiplication\|journal=J. Parallel Distrib. Comput.\|date=September 2004\|volume=64\|issue=9\|pages=1017–26\|doi=10.1016/j.jpdc.2004.03.021\|citeseerx=10.1.1.20.7034}}</ref> In a distributed setting with {{mvar\|p}} processors arranged in a {{math\|{{sqrt\|''p''}}}} by {{math\|{{sqrt\|''p''}}}} 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting {{math\|''O''(''n''<sup>2</sup>/{{sqrt\|''p''}})}} words, which is asymptotically optimal assuming that each node stores the minimum {{math\|''O''(''n''<sup>2</sup>/''p'')}} elements.<ref name=irony/> This can be improved by the ''3D algorithm,'' which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row.<ref name="Agarwal">{{cite journal\|last1=Agarwal\|first1=R.C.\|first2=S. M. \|last2=Balle \|first3=F. G. \|last3=Gustavson \|first4=M. \|last4=Joshi \|first5=P. \|last5=Palkar \|title=A three-dimensional approach to parallel matrix multiplication\|journal=IBM J. Res. Dev.\|date=September 1995\|volume=39\|issue=5\|pages=575–582\|doi=10.1147/rd.395.0575\|citeseerx=10.1.1.44.3404}}</ref> This algorithm transmits {{math\|''O''(''n''<sup>2</sup>/''p''<sup>2/3</sup>)}} words per processor, which is asymptotically optimal.<ref name=irony/> However, this requires replicating each input matrix element {{math\|''p''<sup>1/3</sup>}} times, and so requires a factor of {{math\|''p''<sup>1/3</sup>}} more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime.<ref name="Agarwal"/> "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth.<ref>{{cite conference\|last1=Solomonik\|first1=Edgar\|first2=James \|last2=Demmel\|title=Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms\|book-title=Proceedings of the 17th International Conference on Parallel Processing\|year=2011\|volume=Part II\|pages=90–109 \|doi=10.1007/978-3-642-23397-5_10 \|isbn=978-3-642-23397-5 \|url=https://solomonik.cs.illinois.edu/talks/europar-sep-2011.pdf}}</ref> On modern distributed computing environments such as [[MapReduce]], specialized multiplication algorithms have been developed.<ref>{{cite web \|last1=Bosagh Zadeh\|first1=Reza\|last2=Carlsson\|first2=Gunnar\|title=Dimension Independent Matrix Square Using MapReduce\|year=2013\|arxiv=1304.1467\|bibcode=2013arXiv1304.1467B\|url=https://stanford.edu/~rezab/papers/dimsum.pdf\|access-date=12 July 2014}}</ref>

Matrix multiplication algorithm: Difference between revisions