Content deleted Content added
Reverted 1 edit by 88.235.200.138 (talk): Unexplained removal of content |
Citation bot (talk | contribs) Alter: title, template type. Add: chapter-url, date, chapter. Removed or converted URL. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox3 | #UCB_webform_linked 1249/2306 |
||
Line 242:
On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the ''communication bandwidth''. The naïve algorithm using three nested loops uses {{math|Ω(''n''<sup>3</sup>)}} communication bandwidth.
[[Cannon's algorithm]], also known as the ''2D algorithm'', is a [[communication-avoiding algorithm]] that partitions each input matrix into a block matrix whose elements are submatrices of size {{math|{{sqrt|''M''/3}}}} by {{math|{{sqrt|''M''/3}}}}, where {{math|''M''}} is the size of fast memory.<ref>{{cite thesis |first=Lynn Elliot |last=Cannon |title=A cellular computer to implement the Kalman Filter Algorithm |date=14 July 1969 |type=Ph.D. |publisher=Montana State University |url=https://dl.acm.org/doi/abs/10.5555/905686 }}</ref> The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth to {{math|''O''(''n''<sup>3</sup>/{{sqrt|''M''}})}}, which is asymptotically optimal (for algorithms performing {{math|Ω(''n''<sup>3</sup>)}} computation).<ref>{{cite
In a distributed setting with {{mvar|p}} processors arranged in a {{math|{{sqrt|''p''}}}} by {{math|{{sqrt|''p''}}}} 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting {{math|''O''(''n''<sup>2</sup>/{{sqrt|''p''}})}} words, which is asymptotically optimal assuming that each node stores the minimum {{math|''O''(''n''<sup>2</sup>/''p'')}} elements.<ref name=irony/> This can be improved by the ''3D algorithm,'' which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row.<ref name="Agarwal">{{cite journal|last1=Agarwal|first1=R.C.|first2=S. M. |last2=Balle |first3=F. G. |last3=Gustavson |first4=M. |last4=Joshi |first5=P. |last5=Palkar |title=A three-dimensional approach to parallel matrix multiplication|journal=IBM J. Res. Dev.|date=September 1995|volume=39|issue=5|pages=575–582|doi=10.1147/rd.395.0575|citeseerx=10.1.1.44.3404}}</ref> This algorithm transmits {{math|''O''(''n''<sup>2</sup>/''p''<sup>2/3</sup>)}} words per processor, which is asymptotically optimal.<ref name=irony/> However, this requires replicating each input matrix element {{math|''p''<sup>1/3</sup>}} times, and so requires a factor of {{math|''p''<sup>1/3</sup>}} more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime.<ref name="Agarwal"/> "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth.<ref>{{cite conference|last1=Solomonik|first1=Edgar|first2=James |last2=Demmel|title=Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms|book-title=Proceedings of the 17th International Conference on Parallel Processing|year=2011|volume=Part II|pages=90–109 |doi=10.1007/978-3-642-23397-5_10 |isbn=978-3-642-23397-5 |url=https://solomonik.cs.illinois.edu/talks/europar-sep-2011.pdf}}</ref> On modern distributed computing environments such as [[MapReduce]], specialized multiplication algorithms have been developed.<ref>{{cite web |last1=Bosagh Zadeh|first1=Reza|last2=Carlsson|first2=Gunnar|title=Dimension Independent Matrix Square Using MapReduce|year=2013|arxiv=1304.1467|bibcode=2013arXiv1304.1467B|url=https://stanford.edu/~rezab/papers/dimsum.pdf|access-date=12 July 2014}}</ref>
|