Matrix multiplication algorithm: Difference between revisions

Content deleted Content added
→cite arxiv, tweak cites | Alter: template type. Add: class, pmc, pmid, authors 1-1. Removed proxy/dead URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. | Use this tool. Report bugs. | #UCB_Gadget
→cite thesis, tweak cites | Add: s2cid. | Use this tool. Report bugs. | #UCB_Gadget
Line 2:
Because [[matrix multiplication]] is such a central operation in many [[numerical algorithm]]s, much work has been invested in making '''matrix multiplication algorithms''' efficient. Applications of matrix multiplication in computational problems are found in many fields including [[scientific computing]] and [[pattern recognition]] and in seemingly unrelated problems such as counting the paths through a [[Graph (graph theory)|graph]].<ref name="skiena"/> Many different algorithms have been designed for multiplying matrices on different types of hardware, including [[parallel computing|parallel]] and [[distributed computing|distributed]] systems, where the computational work is spread over multiple processors (perhaps over a network).
 
Directly applying the mathematical definition of matrix multiplication gives an algorithm that [[Analysis of algorithms|takes time]] on the order of {{math|''n''<sup>3</sup>}} [[Field (mathematics)|field]] operations to multiply two {{math|''n'' × ''n''}} matrices over that field ({{math|Θ(''n''<sup>3</sup>)}} in [[big O notation]]). Better asymptotic bounds on the time required to multiply matrices have been known since the [[Strassen algorithm|Strassen's algorithm]] in the 1960s, but the optimal time (that is, the [[computational complexity of matrix multiplication]]) remains unknown. {{As of|2020|12}}, the matrix multiplication algorithm with best asymptotic complexity runs in {{math|O(''n''<sup>2.3728596</sup>)}} time, given by Josh Alman and [[Virginia Vassilevska Williams]].<ref name="aw20">{{Citationcite conference | last1=Alman | first1=Josh | last2=Williams | first2=Virginia Vassilevska | contributiontitle=A Refined Laser Method and Faster Matrix Multiplication | year = 2020 | arxiv=2010.05846 | book-title = 32nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2021) |url=https://www.siam.org/conferences/cm/program/accepted-papers/soda21-accepted-papers |doi=10.1137/1.9781611976465.32 }}</ref><ref>{{Cite web|last=Hartnett|first=Kevin|title=Matrix Multiplication Inches Closer to Mythic Goal|url=https://www.quantamagazine.org/mathematicians-inch-closer-to-matrix-multiplication-goal-20210323/|access-date=2021-04-01|website=Quanta Magazine|date=23 March 2021|language=en}}</ref> However, this algorithm is a [[galactic algorithm]] because of the large constants and cannot be realized practically.
 
In 2022, [[DeepMind]] introduced AlphaTensor, a [[neural network]] that used a single-player game analogy to invent thousands of matrix multiplication algorithms, including some previously discovered by humans.<ref>{{Cite web |title=Discovering novel algorithms with AlphaTensor |url=https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor |access-date=2022-11-01 |website=www.deepmind.com |language=en}}</ref> The best practical algorithm runs in O(n<sup>2.778</sup>) for matrices in the [[GF(2)|finite field <math>\mathbb Z/2\mathbb Z</math>]].<ref>{{Cite journal |last1=Fawzi |first1=Alhussein |last2=Balog |first2=Matej |last3=Huang |first3=Aja |last4=Hubert |first4=Thomas |last5=Romera-Paredes |first5=Bernardino |last6=Barekatain |first6=Mohammadamin |last7=Novikov |first7=Alexander |last8=R. Ruiz |first8=Francisco J. |last9=Schrittwieser |first9=Julian |last10=Swirszcz |first10=Grzegorz |last11=Silver |first11=David |last12=Hassabis |first12=Demis |last13=Kohli |first13=Pushmeet |date=October 2022 |title=Discovering faster matrix multiplication algorithms with reinforcement learning |journal=Nature |volume=610 |issue=7930 |pages=47–53 |doi=10.1038/s41586-022-05172-4 |pmid=36198780 |pmc=9534758 |issn=1476-4687}}</ref>
Line 211:
Here, ''fork'' is a keyword that signal a computation may be run in parallel with the rest of the function call, while ''join'' waits for all previously "forked" computations to complete. {{math|partition}} achieves its goal by pointer manipulation only.
 
This algorithm has a [[critical path length]] of {{math|Θ(log<sup>2</sup> ''n'')}} steps, meaning it takes that much time on an ideal machine with an infinite number of processors; therefore, it has a maximum possible [[speedup]] of {{math|Θ(''n''<sup>3</sup>/log<sup>2</sup> ''n'')}} on any real computer. The algorithm isn't practical due to the communication cost inherent in moving data to and from the temporary matrix {{mvar|T}}, but a more practical variant achieves {{math|Θ(''n''<sup>2</sup>)}} speedup, without using a temporary matrix.<ref name="cilk">{{cite thesis |type=Ph.D. |last=Randall |first=Keith H. |title=Cilk: Efficient Multithreaded Computing |publisher=Massachusetts Institute of Technology |year=1998 |pages=54–57 |url=http://supertech.csail.mit.edu/papers/randall-phdthesis.pdf }}</ref>
 
[[File:Block matrix multiplication.svg|thumb|Block matrix multiplication. In the 2D algorithm, each processor is responsible for one submatrix of {{mvar|C}}. In the 3D algorithm, every pair of submatrices from {{mvar|A}} and {{mvar|B}} that is multiplied is assigned to one processor.]]
Line 218:
On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the ''communication bandwidth''. The naïve algorithm using three nested loops uses {{math|Ω(''n''<sup>3</sup>)}} communication bandwidth.
 
[[Cannon's algorithm]], also known as the ''2D algorithm'', is a [[communication-avoiding algorithm]] that partitions each input matrix into a block matrix whose elements are submatrices of size {{math|{{sqrt|''M''/3}}}} by {{math|{{sqrt|''M''/3}}}}, where {{math|''M''}} is the size of fast memory.<ref>{{cite thesis |first=Lynn Elliot |last=Cannon, ''[http://portal.acm.org/citation.cfm?coll|title=GUIDE&dl=GUIDE&id=905686 A cellular computer to implement the Kalman Filter Algorithm]'', Technical|date=14 report,July 1969 |type=Ph.D. Thesis, |publisher=Montana State University, 14 July 1969|url=https://dl.acm.org/doi/abs/10.5555/905686 }}</ref> The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth to {{math|''O''(''n''<sup>3</sup>/{{sqrt|''M''}})}}, which is asymptotically optimal (for algorithms performing {{math|Ω(''n''<sup>3</sup>)}} computation).<ref>{{cite journal|last1=Hong|first1=J. W.|first2=H. T. |last2=Kung|title=I/O complexity: The red-blue pebble game|journal=STOC '81: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing|year=1981|pages=326–333|url=https://apps.dtic.mil/dtic/tr/fulltext/u2/a104739.pdf|archive-url=https://web.archive.org/web/20191215182754/https://apps.dtic.mil/dtic/tr/fulltext/u2/a104739.pdf|url-status=live|archive-date=December 15, 2019 |doi=10.1145/800076.802486|s2cid=8410593 }}</ref><ref name=irony>{{cite journal|last1=Irony|first1=Dror|first2=Sivan |last2=Toledo |first3=Alexander |last3=Tiskin |title=Communication lower bounds for distributed-memory matrix multiplication|journal=J. Parallel Distrib. Comput.|date=September 2004|volume=64|issue=9|pages=1017–26|doi=10.1016/j.jpdc.2004.03.021|citeseerx=10.1.1.20.7034}}</ref>
 
In a distributed setting with {{mvar|p}} processors arranged in a {{math|{{sqrt|''p''}}}} by {{math|{{sqrt|''p''}}}} 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting {{math|''O''(''n''<sup>2</sup>/{{sqrt|''p''}})}} words, which is asymptotically optimal assuming that each node stores the minimum {{math|''O''(''n''<sup>2</sup>/''p'')}} elements.<ref name=irony/> This can be improved by the ''3D algorithm,'' which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row.<ref name="Agarwal">{{cite journal|last1=Agarwal|first1=R.C.|first2=S. M. |last2=Balle |first3=F. G. |last3=Gustavson |first4=M. |last4=Joshi |first5=P. |last5=Palkar |title=A three-dimensional approach to parallel matrix multiplication|journal=IBM J. Res. Dev.|date=September 1995|volume=39|issue=5|pages=575–582|doi=10.1147/rd.395.0575|citeseerx=10.1.1.44.3404}}</ref> This algorithm transmits {{math|''O''(''n''<sup>2</sup>/''p''<sup>2/3</sup>)}} words per processor, which is asymptotically optimal.<ref name=irony/> However, this requires replicating each input matrix element {{math|''p''<sup>1/3</sup>}} times, and so requires a factor of {{math|''p''<sup>1/3</sup>}} more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime.<ref name="Agarwal"/> "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth.<ref>{{cite journalconference|last1=Solomonik|first1=Edgar|first2=James |last2=Demmel|title=Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms|journalbook-title=Proceedings of the 17th International Conference on Parallel Processing|year=2011|volume=Part II|pages=90–109 |doi=10.1007/978-3-642-23397-5_10 |isbn=978-3-642-23397-5 |url=https://solomonik.cs.illinois.edu/talks/europar-sep-2011.pdf}}</ref> On modern distributed computing environments such as [[MapReduce]], specialized multiplication algorithms have been developed.<ref>{{cite web |last1=Bosagh Zadeh|first1=Reza|last2=Carlsson|first2=Gunnar|title=Dimension Independent Matrix Square Using MapReduce|year=2013|arxiv=1304.1467|bibcode=2013arXiv1304.1467B|url=https://stanford.edu/~rezab/papers/dimsum.pdf|access-date=12 July 2014}}</ref>
 
===Algorithms for meshes===