Revision as of 18:03, 15 May 2025 edit David Eppstein (talk \| contribs) Autopatrolled, Administrators 235,585 edits no spacing between footnote markers ← Previous edit		Revision as of 22:25, 15 May 2025 edit undo David Eppstein (talk \| contribs) Autopatrolled, Administrators 235,585 edits big batch of authorlinks Next edit →
Line 5: {{Citation \| last1=Williams \| first1=Virginia Vassilevska \| author1-link = Virginia Vassilevska Williams \| last2=Xu \| first2=Yinzhan Line 31: \| first1=Josh \| last2=Williams \| first2=Virginia Vassilevska \| author2-link = Virginia Vassilevska Williams \| year = 2024 \| arxiv=2010.05846 Line 67: which order is best also depends on whether the matrices are stored in [[row- and column-major order\|row-major order, column-major order]], or a mix of both. In particular, in the idealized case of a [[CPU cache#Associativity\|fully associative cache]] consisting of {{mvar\|M}} bytes and {{mvar\|b}} bytes per cache line (i.e. {{sfrac\|''M''\|''b''}} cache lines), the above algorithm is sub-optimal for {{mvar\|A}} and {{mvar\|B}} stored in row-major order. When {{math\|''n'' > {{sfrac\|''M''\|''b''}}}}, every iteration of the inner loop (a simultaneous sweep through a row of {{mvar\|A}} and a column of {{mvar\|B}}) incurs a cache miss when accessing an element of {{mvar\|B}}. This means that the algorithm incurs {{math\|Θ(''n''<sup>3</sup>)}} cache misses in the worst case. {{As of\|2010}}, the speed of memories compared to that of processors is such that the cache misses, rather than the actual calculations, dominate the running time for sizable matrices.<ref name="ocw">{{cite web\|url=http://aka-ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2010/video-lectures/lecture-8-cache-efficient-algorithms/\|title=6.172 Performance Engineering of Software Systems, Lecture 8\|last1=Amarasinghe\|first1=Saman\|last2=Leiserson\|first2=Charles\|author2-link=Charles E. Leiserson\|year=2010\|website=MIT OpenCourseWare\|publisher=Massachusetts Institute of Technology\|access-date=27 January 2015}}</ref> The optimal variant of the iterative algorithm for {{mvar\|A}} and {{mvar\|B}} in row-major layout is a ''[[loop tiling\|tiled]]'' version, where the matrix is implicitly divided into square tiles of size {{math\|{{radic\|''M''}}}} by {{math\|{{radic\|''M''}}}}:<ref name="ocw"/><ref>{{cite conference \|first1=Monica S. \|last1=Lam \|author1-link=Monica S. Lam\|first2=Edward E. \|last2=Rothberg \|first3=Michael E. \|last3=Wolf \|title=The Cache Performance and Optimizations of Blocked Algorithms \|conference=ASPLOS91: 4th Int'l Conference on Architecture Support for Programming Languages & Operating Systems \|isbn=978-0-89791-380-5 \|year=1991 \|doi=10.1145/106972.106981}}</ref> {{framebox\|blue}} Line 171: [[File:MatrixMultComplexity svg.svg\|thumb\|400px\|right\|Improvement of estimates of exponent {{math\|ω}} over time for the computational complexity of matrix multiplication <math>O(n^\omega)</math>.]] Algorithms exist that provide better running times than the straightforward ones. The first to be discovered was [[Strassen algorithm\|Strassen's algorithm]], devised by [[Volker Strassen]] in 1969 and often referred to as "fast matrix multiplication". It is based on a way of multiplying two 2×2 matrices which requires only 7 multiplications (instead of the usual 8), at the expense of several additional addition and subtraction operations. Applying this recursively gives an algorithm with a multiplicative cost of <math>O( n^{\log_{2}7}) \approx O(n^{2.807})</math>. Strassen's algorithm is more complex, and the [[numerical stability]] is reduced compared to the naïve algorithm,<ref>{{Citation \| last1=Miller \| first1=Webb\|author1-link=Webb Miller \| title=Computational complexity and numerical stability \| citeseerx = 10.1.1.148.9947 \| year=1975 \| journal=SIAM News \| volume=4 \| issue=2 \| pages=97–107 \| doi=10.1137/0204009}}</ref> but it is faster in cases where {{math\|''n'' > 100}} or so<ref name="skiena"/> and appears in several libraries, such as [[Basic Linear Algebra Subprograms\|BLAS]].<ref>{{cite book \|last1=Press \|first1=William H. \|last2=Flannery \|first2=Brian P. \|last3=Teukolsky \|first3=Saul A. \|author3-link=Saul Teukolsky \|last4=Vetterling \|first4=William T. \|title=Numerical Recipes: The Art of Scientific Computing \|publisher=[[Cambridge University Press]] \|edition=3rd \|isbn=978-0-521-88068-8 \|year=2007 \|page=[https://archive.org/details/numericalrecipes00pres_033/page/n131 108]\|title-link=Numerical Recipes }}</ref> It is very useful for large matrices over exact domains such as [[finite field]]s, where numerical stability is not an issue. Since Strassen's algorithm is actually used in practical numerical software and computer algebra systems, improving on the constants hidden in the [[big O notation\|big-O notation]] has its merits. A table that compares key aspects of the improved version based on recursive multiplication of 2×2-block matrices via 7 block matrix multiplications follows. As usual, <math>n</math> gives the dimensions of the matrix and <math>M</math> designates the memory size. Line 180: ! Year !! Reference !! #matrix multiplications per step !! #matrix additions per step !! total arithmetic operations !! total I/O-complexity \|- \| 1969 \|\| Strassen<ref>{{cite journal \|last=Strassen \|first=Volker \|author-link=Volker Strassen\|title=Gaussian Elimination is not Optimal \|journal=Numer. Math. \|volume=13 \|issue= 4 \|pages=354–356 \|year=1969 \|doi=10.1007/BF02165411 \|s2cid=121656251 }}</ref> \|\| 7 \|\| 18 \|\| <math>7n^{\log_2 7}-6n^2</math> \|\| <math>6\left(\frac{\sqrt{3}n}{\sqrt{M}}\right)^{\log_2 7}\cdot M-18n^2 +3M</math> \|- \| 1971 \|\| Winograd<ref>{{cite journal \|last=Winograd \|first=Shmuel \|author-link=Shmuel Winograd\|title=On multiplication of 2×2 matrices \|journal=Linear Algebra and Its Applications \|volume=4 \|issue= 4 \|pages=381–388 \|year=1971 \|doi=10.1016/0024-3795(71)90009-7\|doi-access=free }}</ref> \|\| 7 \|\| 15 \|\| <math>6n^{\log_2 7}-5n^2</math> \|\| <math>5\left(\frac{\sqrt{3}n}{\sqrt{M}}\right)^{\log_2 7}\cdot M-15n^2 +3M</math> \|- \| 2017 \|\| Karstadt, Schwartz<ref>{{cite conference \|url=https://dl.acm.org/doi/10.1145/3087556.3087579 \|title=Matrix Multiplication, a Little Faster \|last1=Karstadt \|first1=Elaye \|last2=Schwartz \|first2=Oded \|date=July 2017 \|publisher= \|book-title=Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures \|pages=101–110 \|conference=SPAA '17 \|doi=10.1145/3087556.3087579}}</ref> \|\| 7 \|\| 12 \|\| <math>5n^{\log_2 7}-4n^2+3n^2\log_2n</math> \|\| <math>4\left(\frac{\sqrt{3}n}{\sqrt{M}}\right)^{\log_2 7}\cdot M-12n^2 +3n^2\cdot\log_2\left(\frac{\sqrt{2}n}{\sqrt{M}}\right) +5M</math> Line 189: \|} It is known that a Strassen-like algorithm with a 2×2-block matrix step requires at least 7 block matrix multiplications. In 1976 Probert<ref>{{cite journal \|last=Probert \|first=Robert L. \|title=On the additive complexity of matrix multiplication \|journal=SIAM J. Comput. \|volume=5 \|issue= 2 \|pages=187–203 \|year=1976 \|doi=10.1137/0205016}}</ref> showed that such an algorithm requires at least 15 additions (including subtractions); however, a hidden assumption was that the blocks and the 2×2-block matrix are represented in the same basis. Karstadt and Schwartz computed in different bases and traded 3 additions for less expensive basis transformations. They also proved that one cannot go below 12 additions per step using different bases. In subsequent work Beniamini et el.<ref>{{cite arXiv\|eprint=2008.03759 \|last1=Beniamini \|first1=Gal \|last2=Cheng \|first2=Nathan \|last3=Holtz \|first3=Olga \|author3-link=Olga Holtz\|last4=Karstadt \|first4=Elaye \|last5=Schwartz \|first5=Oded \|title=Sparsifying the Operators of Fast Matrix Multiplication Algorithms \|date=2020 \|class=cs.DS }}</ref> applied this base-change trick to more general decompositions than 2×2-block matrices and improved the leading constant for their run times. It is an open question in [[theoretical computer science]] how well Strassen's algorithm can be improved in terms of [[Time complexity\|asymptotic complexity]]. The ''matrix multiplication exponent'', usually denoted <math>\omega</math>, is the smallest real number for which any <math>n\times n</math> matrix over a field can be multiplied together using <math>n^{\omega + o(1)}</math> field operations. The current best bound on <math>\omega</math> is <math>\omega < 2.371552</math>, by [[Virginia Vassilevska Williams\|Williams]], Xu, Xu, and Zhou.<ref name="apr24w"/><ref name="aw20" /> This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–Winograd algorithm, which was given by [[Don Coppersmith]] and [[Shmuel Winograd]] in 1990.<ref name="coppersmith">{{Citation\|doi=10.1016/S0747-7171(08)80013-2 \|title=Matrix multiplication via arithmetic progressions \|url=http://www.cs.umd.edu/~gasarch/TOPICS/ramsey/matrixmult.pdf \|year=1990 \|last1=Coppersmith \|first1=Don \|author1-link=Don Coppersmith\|last2=Winograd \|first2=Shmuel\|author2-link=Shmuel Winograd \|journal=Journal of Symbolic Computation \|volume=9\|issue=3\|pages=251\|doi-access=free }}</ref> The conceptual idea of these algorithms is similar to Strassen's algorithm: a way is devised for multiplying two {{math\|''k'' × ''k''}}-matrices with fewer than {{math\|''k''<sup>3</sup>}} multiplications, and this technique is applied recursively. However, the constant coefficient hidden by the [[Big O notation\|big-O notation]] is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers.<ref>{{citation \| last = Iliopoulos \| first = Costas S. Line 275: On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the ''communication bandwidth''. The naïve algorithm using three nested loops uses {{math\|Ω(''n''<sup>3</sup>)}} communication bandwidth. [[Cannon's algorithm]], also known as the ''2D algorithm'', is a [[communication-avoiding algorithm]] that partitions each input matrix into a block matrix whose elements are submatrices of size {{math\|{{sqrt\|''M''/3}}}} by {{math\|{{sqrt\|''M''/3}}}}, where {{math\|''M''}} is the size of fast memory.<ref>{{cite thesis \|first=Lynn Elliot \|last=Cannon \|title=A cellular computer to implement the Kalman Filter Algorithm \|date=14 July 1969 \|type=Ph.D. \|publisher=Montana State University \|url=https://dl.acm.org/doi/abs/10.5555/905686 }}</ref> The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth to {{math\|''O''(''n''<sup>3</sup>/{{sqrt\|''M''}})}}, which is asymptotically optimal (for algorithms performing {{math\|Ω(''n''<sup>3</sup>)}} computation).<ref>{{cite book\|last1=Hong\|first1=J. W.\|first2=H. T. \|last2=Kung\|author2-link=H. T. Kung\|chapter=I/O complexity: The red-blue pebble game \|title=Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81 \|year=1981\|pages=326–333\|chapter-url=https://apps.dtic.mil/dtic/tr/fulltext/u2/a104739.pdf\|archive-url=https://web.archive.org/web/20191215182754/https://apps.dtic.mil/dtic/tr/fulltext/u2/a104739.pdf\|url-status=live\|archive-date=December 15, 2019 \|doi=10.1145/800076.802486\|s2cid=8410593 }}</ref><ref name=irony>{{cite journal\|last1=Irony\|first1=Dror\|first2=Sivan \|last2=Toledo \|first3=Alexander \|last3=Tiskin \|title=Communication lower bounds for distributed-memory matrix multiplication\|journal=J. Parallel Distrib. Comput.\|date=September 2004\|volume=64\|issue=9\|pages=1017–26\|doi=10.1016/j.jpdc.2004.03.021\|citeseerx=10.1.1.20.7034}}</ref> In a distributed setting with {{mvar\|p}} processors arranged in a {{math\|{{sqrt\|''p''}}}} by {{math\|{{sqrt\|''p''}}}} 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting {{math\|''O''(''n''<sup>2</sup>/{{sqrt\|''p''}})}} words, which is asymptotically optimal assuming that each node stores the minimum {{math\|''O''(''n''<sup>2</sup>/''p'')}} elements.<ref name=irony/> This can be improved by the ''3D algorithm,'' which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row.<ref name="Agarwal">{{cite journal\|last1=Agarwal\|first1=R.C.\|first2=S. M. \|last2=Balle \|first3=F. G. \|last3=Gustavson \|first4=M. \|last4=Joshi \|first5=P. \|last5=Palkar \|title=A three-dimensional approach to parallel matrix multiplication\|journal=IBM J. Res. Dev.\|date=September 1995\|volume=39\|issue=5\|pages=575–582\|doi=10.1147/rd.395.0575\|citeseerx=10.1.1.44.3404}}</ref> This algorithm transmits {{math\|''O''(''n''<sup>2</sup>/''p''<sup>2/3</sup>)}} words per processor, which is asymptotically optimal.<ref name=irony/> However, this requires replicating each input matrix element {{math\|''p''<sup>1/3</sup>}} times, and so requires a factor of {{math\|''p''<sup>1/3</sup>}} more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime.<ref name="Agarwal"/> "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth.<ref>{{cite conference\|last1=Solomonik\|first1=Edgar\|first2=James \|last2=Demmel\|author2-link=James Demmel\|title=Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms\|book-title=Proceedings of the 17th International Conference on Parallel Processing\|year=2011\|volume=Part II\|pages=90–109 \|doi=10.1007/978-3-642-23397-5_10 \|isbn=978-3-642-23397-5 \|url=https://solomonik.cs.illinois.edu/talks/europar-sep-2011.pdf}}</ref> On modern distributed computing environments such as [[MapReduce]], specialized multiplication algorithms have been developed.<ref>{{cite web \|last1=Bosagh Zadeh\|first1=Reza\|last2=Carlsson\|first2=Gunnar\|title=Dimension Independent Matrix Square Using MapReduce\|year=2013\|arxiv=1304.1467\|bibcode=2013arXiv1304.1467B\|url=https://stanford.edu/~rezab/papers/dimsum.pdf\|access-date=12 July 2014}}</ref> ===Algorithms for meshes===

Matrix multiplication algorithm: Difference between revisions