Revision as of 03:42, 4 November 2022 edit RDBrown (talk \| contribs) Extended confirmed users 15,902 edits →cite arxiv, tweak cites \| Alter: template type. Add: class, pmc, pmid, authors 1-1. Removed proxy/dead URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. \| Use this tool. Report bugs. \| #UCB_Gadget ← Previous edit		Revision as of 01:43, 5 November 2022 edit undo RDBrown (talk \| contribs) Extended confirmed users 15,902 edits →cite thesis, tweak cites \| Add: s2cid. \| Use this tool. Report bugs. \| #UCB_Gadget Next edit →
Line 2: Because [[matrix multiplication]] is such a central operation in many [[numerical algorithm]]s, much work has been invested in making '''matrix multiplication algorithms''' efficient. Applications of matrix multiplication in computational problems are found in many fields including [[scientific computing]] and [[pattern recognition]] and in seemingly unrelated problems such as counting the paths through a [[Graph (graph theory)\|graph]].<ref name="skiena"/> Many different algorithms have been designed for multiplying matrices on different types of hardware, including [[parallel computing\|parallel]] and [[distributed computing\|distributed]] systems, where the computational work is spread over multiple processors (perhaps over a network). Directly applying the mathematical definition of matrix multiplication gives an algorithm that [[Analysis of algorithms\|takes time]] on the order of {{math\|''n''<sup>3</sup>}} [[Field (mathematics)\|field]] operations to multiply two {{math\|''n'' × ''n''}} matrices over that field ({{math\|Θ(''n''<sup>3</sup>)}} in [[big O notation]]). Better asymptotic bounds on the time required to multiply matrices have been known since the [[Strassen algorithm\|Strassen's algorithm]] in the 1960s, but the optimal time (that is, the [[computational complexity of matrix multiplication]]) remains unknown. {{As of\|2020\|12}}, the matrix multiplication algorithm with best asymptotic complexity runs in {{math\|O(''n''<sup>2.3728596</sup>)}} time, given by Josh Alman and [[Virginia Vassilevska Williams]].<ref name="aw20">{{~~Citation~~cite conference \| last1=Alman \| first1=Josh \| last2=Williams \| first2=Virginia Vassilevska \| ~~contribution~~title=A Refined Laser Method and Faster Matrix Multiplication \| year = 2020 \| arxiv=2010.05846 \| book-title = 32nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2021) \|url=https://www.siam.org/conferences/cm/program/accepted-papers/soda21-accepted-papers \|doi=10.1137/1.9781611976465.32 }}</ref><ref>{{Cite web\|last=Hartnett\|first=Kevin\|title=Matrix Multiplication Inches Closer to Mythic Goal\|url=https://www.quantamagazine.org/mathematicians-inch-closer-to-matrix-multiplication-goal-20210323/\|access-date=2021-04-01\|website=Quanta Magazine\|date=23 March 2021\|language=en}}</ref> However, this algorithm is a [[galactic algorithm]] because of the large constants and cannot be realized practically. In 2022, [[DeepMind]] introduced AlphaTensor, a [[neural network]] that used a single-player game analogy to invent thousands of matrix multiplication algorithms, including some previously discovered by humans.<ref>{{Cite web \|title=Discovering novel algorithms with AlphaTensor \|url=https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor \|access-date=2022-11-01 \|website=www.deepmind.com \|language=en}}</ref> The best practical algorithm runs in O(n<sup>2.778</sup>) for matrices in the [[GF(2)\|finite field <math>\mathbb Z/2\mathbb Z</math>]].<ref>{{Cite journal \|last1=Fawzi \|first1=Alhussein \|last2=Balog \|first2=Matej \|last3=Huang \|first3=Aja \|last4=Hubert \|first4=Thomas \|last5=Romera-Paredes \|first5=Bernardino \|last6=Barekatain \|first6=Mohammadamin \|last7=Novikov \|first7=Alexander \|last8=R. Ruiz \|first8=Francisco J. \|last9=Schrittwieser \|first9=Julian \|last10=Swirszcz \|first10=Grzegorz \|last11=Silver \|first11=David \|last12=Hassabis \|first12=Demis \|last13=Kohli \|first13=Pushmeet \|date=October 2022 \|title=Discovering faster matrix multiplication algorithms with reinforcement learning \|journal=Nature \|volume=610 \|issue=7930 \|pages=47–53 \|doi=10.1038/s41586-022-05172-4 \|pmid=36198780 \|pmc=9534758 \|issn=1476-4687}}</ref> Line 211: Here, ''fork'' is a keyword that signal a computation may be run in parallel with the rest of the function call, while ''join'' waits for all previously "forked" computations to complete. {{math\|partition}} achieves its goal by pointer manipulation only. This algorithm has a [[critical path length]] of {{math\|Θ(log<sup>2</sup> ''n'')}} steps, meaning it takes that much time on an ideal machine with an infinite number of processors; therefore, it has a maximum possible [[speedup]] of {{math\|Θ(''n''<sup>3</sup>/log<sup>2</sup> ''n'')}} on any real computer. The algorithm isn't practical due to the communication cost inherent in moving data to and from the temporary matrix {{mvar\|T}}, but a more practical variant achieves {{math\|Θ(''n''<sup>2</sup>)}} speedup, without using a temporary matrix.<ref name="cilk">{{cite thesis \|type=Ph.D. \|last=Randall \|first=Keith H. \|title=Cilk: Efficient Multithreaded Computing \|publisher=Massachusetts Institute of Technology \|year=1998 \|pages=54–57 \|url=http://supertech.csail.mit.edu/papers/randall-phdthesis.pdf }}</ref> [[File:Block matrix multiplication.svg\|thumb\|Block matrix multiplication. In the 2D algorithm, each processor is responsible for one submatrix of {{mvar\|C}}. In the 3D algorithm, every pair of submatrices from {{mvar\|A}} and {{mvar\|B}} that is multiplied is assigned to one processor.]] Line 218: On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the ''communication bandwidth''. The naïve algorithm using three nested loops uses {{math\|Ω(''n''<sup>3</sup>)}} communication bandwidth. [[Cannon's algorithm]], also known as the ''2D algorithm'', is a [[communication-avoiding algorithm]] that partitions each input matrix into a block matrix whose elements are submatrices of size {{math\|{{sqrt\|''M''/3}}}} by {{math\|{{sqrt\|''M''/3}}}}, where {{math\|''M''}} is the size of fast memory.<ref>{{cite thesis \|first=Lynn Elliot \|last=Cannon, ~~''[http://portal.acm.org/citation.cfm?coll~~\|title=~~GUIDE&dl=GUIDE&id=905686~~ A cellular computer to implement the Kalman Filter Algorithm~~]'',~~ ~~Technical~~\|date=14 ~~report,~~July 1969 \|type=Ph.D. ~~Thesis,~~ \|publisher=Montana State University, ~~14 July 1969~~\|url=https://dl.acm.org/doi/abs/10.5555/905686 }}</ref> The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth to {{math\|''O''(''n''<sup>3</sup>/{{sqrt\|''M''}})}}, which is asymptotically optimal (for algorithms performing {{math\|Ω(''n''<sup>3</sup>)}} computation).<ref>{{cite journal\|last1=Hong\|first1=J. W.\|first2=H. T. \|last2=Kung\|title=I/O complexity: The red-blue pebble game\|journal=STOC '81: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing\|year=1981\|pages=326–333\|url=https://apps.dtic.mil/dtic/tr/fulltext/u2/a104739.pdf\|archive-url=https://web.archive.org/web/20191215182754/https://apps.dtic.mil/dtic/tr/fulltext/u2/a104739.pdf\|url-status=live\|archive-date=December 15, 2019 \|doi=10.1145/800076.802486\|s2cid=8410593 }}</ref><ref name=irony>{{cite journal\|last1=Irony\|first1=Dror\|first2=Sivan \|last2=Toledo \|first3=Alexander \|last3=Tiskin \|title=Communication lower bounds for distributed-memory matrix multiplication\|journal=J. Parallel Distrib. Comput.\|date=September 2004\|volume=64\|issue=9\|pages=1017–26\|doi=10.1016/j.jpdc.2004.03.021\|citeseerx=10.1.1.20.7034}}</ref> In a distributed setting with {{mvar\|p}} processors arranged in a {{math\|{{sqrt\|''p''}}}} by {{math\|{{sqrt\|''p''}}}} 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting {{math\|''O''(''n''<sup>2</sup>/{{sqrt\|''p''}})}} words, which is asymptotically optimal assuming that each node stores the minimum {{math\|''O''(''n''<sup>2</sup>/''p'')}} elements.<ref name=irony/> This can be improved by the ''3D algorithm,'' which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row.<ref name="Agarwal">{{cite journal\|last1=Agarwal\|first1=R.C.\|first2=S. M. \|last2=Balle \|first3=F. G. \|last3=Gustavson \|first4=M. \|last4=Joshi \|first5=P. \|last5=Palkar \|title=A three-dimensional approach to parallel matrix multiplication\|journal=IBM J. Res. Dev.\|date=September 1995\|volume=39\|issue=5\|pages=575–582\|doi=10.1147/rd.395.0575\|citeseerx=10.1.1.44.3404}}</ref> This algorithm transmits {{math\|''O''(''n''<sup>2</sup>/''p''<sup>2/3</sup>)}} words per processor, which is asymptotically optimal.<ref name=irony/> However, this requires replicating each input matrix element {{math\|''p''<sup>1/3</sup>}} times, and so requires a factor of {{math\|''p''<sup>1/3</sup>}} more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime.<ref name="Agarwal"/> "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth.<ref>{{cite ~~journal~~conference\|last1=Solomonik\|first1=Edgar\|first2=James \|last2=Demmel\|title=Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms\|~~journal~~book-title=Proceedings of the 17th International Conference on Parallel Processing\|year=2011\|volume=Part II\|pages=90–109 \|doi=10.1007/978-3-642-23397-5_10 \|isbn=978-3-642-23397-5 \|url=https://solomonik.cs.illinois.edu/talks/europar-sep-2011.pdf}}</ref> On modern distributed computing environments such as [[MapReduce]], specialized multiplication algorithms have been developed.<ref>{{cite web \|last1=Bosagh Zadeh\|first1=Reza\|last2=Carlsson\|first2=Gunnar\|title=Dimension Independent Matrix Square Using MapReduce\|year=2013\|arxiv=1304.1467\|bibcode=2013arXiv1304.1467B\|url=https://stanford.edu/~rezab/papers/dimsum.pdf\|access-date=12 July 2014}}</ref> ===Algorithms for meshes===

Matrix multiplication algorithm: Difference between revisions