Matrix multiplication algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 14:21, 18 June 2023 edit Citation bot (talk \| contribs) Bots 5,861,370 edits Add: bibcode. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox2 \| #UCB_webform_linked 545/999 ← Previous edit		Latest revision as of 13:22, 24 June 2025 edit undo OAbot (talk \| contribs) Bots 643,717 edits m Open access bot: url-access=subscription updated in citation with #oabot.
(38 intermediate revisions by 23 users not shown)
Line 2: Because [[matrix multiplication]] is such a central operation in many [[numerical algorithm]]s, much work has been invested in making '''matrix multiplication algorithms''' efficient. Applications of matrix multiplication in computational problems are found in many fields including [[scientific computing]] and [[pattern recognition]] and in seemingly unrelated problems such as counting the paths through a [[Graph (graph theory)\|graph]].<ref name="skiena"/> Many different algorithms have been designed for multiplying matrices on different types of hardware, including [[parallel computing\|parallel]] and [[distributed computing\|distributed]] systems, where the computational work is spread over multiple processors (perhaps over a network). Directly applying the mathematical definition of matrix multiplication gives an algorithm that [[Analysis of algorithms\|takes time]] on the order of {{math\|''n''<sup>3</sup>}} [[Field (mathematics)\|field]] operations to multiply two {{math\|''n'' × ''n''}} matrices over that field ({{math\|Θ(''n''<sup>3</sup>)}} in [[big O notation]]). Better asymptotic bounds on the time required to multiply matrices have been known since the [[Strassen algorithm\|Strassen's algorithm]] in the 1960s, but the optimal time (that is, the [[computational complexity of matrix multiplication]]) remains unknown. {{As of\|~~2022~~2024\|1004}}, the best announced bound on the [[Time complexity\|asymptotic complexity]] of a matrix multiplication algorithm is {{math\|O(''n''<sup>2.~~37188~~371552</sup>)}} time, given by ~~Duan~~[[Virginia Vassilevska Williams\|Williams]], WuXu, Xu, and Zhou.<ref name="~~dwz22~~apr24w"> {{Citation \| last1=Williams \| first1=Virginia Vassilevska \| author1-link = Virginia Vassilevska Williams \| last2=Xu \| first2=Yinzhan \| last3=Xu \| first3=Zixuan \| last4=Zhou \| first4=Renfei \| year = ~~2020~~2024▼ \| arxiv=2307.07970 \| title = New Bounds for Matrix Multiplication: from Alpha to Omega }}</ref><ref name="dwz22"> {{Citation \| last1=Duan Line 13 ⟶ 26: \| arxiv=2210.10173 \| title = Faster Matrix Multiplication via Asymmetric Hashing }}</ref> ~~announced in a preprint.~~ This improves on the bound of {{math\|O(''n''<sup>2.3728596</sup>)}} time, given by ~~Josh~~ Alman and ~~[[Virginia Vassilevska~~ Williams]].<ref name="aw20"> {{Citation \| last1=Alman \| first1=Josh \| last2=Williams \| first2=Virginia Vassilevska \| author2-link = Virginia Vassilevska Williams \| year = 2024 ~~\| contribution=A Refined Laser Method and Faster Matrix Multiplication~~ ▲ \| year = 2020 \| arxiv=2010.05846 \| title = ~~32nd~~A ~~Annual~~Refined ~~ACM-SIAM~~Laser ~~Symposium~~Method onand ~~Discrete~~Faster ~~Algorithms~~Matrix ~~(SODA 2021)~~Multiplication \| journal=Theoretics \| volume=3 \| doi=10.46298/theoretics.24.21 \|url=https://www.siam.org/conferences/cm/program/accepted-papers/soda21-accepted-papers }}</ref><ref name="quanta">{{Cite web\|last=Hartnett\|first=Kevin\|title=Matrix Multiplication Inches Closer to Mythic Goal\|url=https://www.quantamagazine.org/mathematicians-inch-closer-to-matrix-multiplication-goal-20210323/\|access-date=2021-04-01\|website=Quanta Magazine\|date=23 March 2021 \|language=en}}</ref> However, this algorithm is a [[galactic algorithm]] because of the large constants and cannot be realized practically. Line 45 ⟶ 60: {{frame-footer}} This algorithm takes [[time complexity\|time]] {{math\|Θ(''nmp'')}} (in [[asymptotic notation]]).<ref name="skiena"/> A common simplification for the purpose of [[analysis of algorithms\|~~algorithms~~algorithm analysis]] is to assume that the inputs are all square matrices of size {{math\|''n'' × ''n''}}, in which case the running time is {{math\|Θ(''n''<sup>3</sup>)}}, i.e., cubic in the size of the dimension.<ref name="clrs">{{Introduction to Algorithms\|3\|pages=75–79}}</ref> ===Cache behavior=== [[File:Row_and_column_major_order.svg\|thumb\|upright\|Illustration of [[row- and column-major order]] ]] The three loops in iterative matrix multiplication can be arbitrarily swapped with each other without an effect on correctness or asymptotic running time. However, the order can have a considerable impact on practical performance due to the [[locality of reference\|memory access patterns]] and [[CPU cache\|cache]] use of the algorithm;<ref name="skiena">{{cite book \|first=Steven \|last=Skiena \|date=2012 \|author-link=Steven Skiena \|title=The Algorithm Design Manual \|url=https://archive.org/details/algorithmdesignm00skie_772 \|url-access=limited \|publisher=Springer ~~\|year=2008~~ \|pages=[https://archive.org/details/algorithmdesignm00skie_772/page/n56 45]–46, 401–3 \|doi=10.1007/978-1-84800-070-4_4\|chapter=Sorting and Searching \|isbn=978-1-84800-069-8 }}</ref> which order is best also depends on whether the matrices are stored in [[row- and column-major order\|row-major order, column-major order]], or a mix of both. In particular, in the idealized case of a [[CPU cache#Associativity\|fully associative cache]] consisting of {{mvar\|M}} bytes and {{mvar\|b}} bytes per cache line (i.e. {{sfrac\|''M''\|''b''}} cache lines), the above algorithm is sub-optimal for {{mvar\|A}} and {{mvar\|B}} stored in row-major order. When {{math\|''n'' > {{sfrac\|''M''\|''b''}}}}, every iteration of the inner loop (a simultaneous sweep through a row of {{mvar\|A}} and a column of {{mvar\|B}}) incurs a cache miss when accessing an element of {{mvar\|B}}. This means that the algorithm incurs {{math\|Θ(''n''<sup>3</sup>)}} cache misses in the worst case. {{As of\|2010}}, the speed of memories compared to that of processors is such that the cache misses, rather than the actual calculations, dominate the running time for sizable matrices.<ref name="ocw">{{cite web\|url=http://aka-ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2010/video-lectures/lecture-8-cache-efficient-algorithms/\|title=6.172 Performance Engineering of Software Systems, Lecture 8\|last1=Amarasinghe\|first1=Saman\|last2=Leiserson\|first2=Charles\|author2-link=Charles E. Leiserson\|year=2010\|website=MIT OpenCourseWare\|publisher=Massachusetts Institute of Technology\|access-date=27 January 2015}}</ref> The optimal variant of the iterative algorithm for {{mvar\|A}} and {{mvar\|B}} in row-major layout is a ''[[loop tiling\|tiled]]'' version, where the matrix is implicitly divided into square tiles of size {{math\|{{radic\|''M''}}}} by {{math\|{{radic\|''M''}}}}:<ref name="ocw"/><ref>{{cite conference \|first1=Monica S. \|last1=Lam \|author1-link=Monica S. Lam\|first2=Edward E. \|last2=Rothberg \|first3=Michael E. \|last3=Wolf \|title=The Cache Performance and Optimizations of Blocked Algorithms \|conference=ASPLOS91: 4th Int'l Conference on Architecture Support for Programming Languages & Operating Systems \|isbn=978-0-89791-380-5 \|year=1991 \|doi=10.1145/106972.106981}}</ref> {{framebox\|blue}} Line 156 ⟶ 171: [[File:MatrixMultComplexity svg.svg\|thumb\|400px\|right\|Improvement of estimates of exponent {{math\|ω}} over time for the computational complexity of matrix multiplication <math>O(n^\omega)</math>.]] Algorithms exist that provide better running times than the straightforward ones. The first to be discovered was [[Strassen algorithm\|Strassen's algorithm]], devised by [[Volker Strassen]] in 1969 and often referred to as "fast matrix multiplication". It is based on a way of multiplying two ~~{{math\|2~~2×2 ~~× 2}}-~~matrices which requires only 7 multiplications (instead of the usual 8), at the expense of several additional addition and subtraction operations. Applying this recursively gives an algorithm with a multiplicative cost of <math>O( n^{\log_{2}7}) \approx O(n^{2.807})</math>. Strassen's algorithm is more complex, and the [[numerical stability]] is reduced compared to the naïve algorithm,<ref>{{Citation \| last1=Miller \| first1=Webb\|author1-link=Webb Miller \| title=Computational complexity and numerical stability \| citeseerx = 10.1.1.148.9947 \| year=1975 \| journal=SIAM News \| volume=4 \| issue=2 \| pages=97–107 \| doi=10.1137/0204009}}</ref> but it is faster in cases where {{math\|''n'' > 100}} or so<ref name="skiena"/> and appears in several libraries, such as [[Basic Linear Algebra Subprograms\|BLAS]].<ref>{{cite book \|last1=Press \|first1=William H. \|last2=Flannery \|first2=Brian P. \|last3=Teukolsky \|first3=Saul A. \|author3-link=Saul Teukolsky \|last4=Vetterling \|first4=William T. \|title=Numerical Recipes: The Art of Scientific Computing \|publisher=[[Cambridge University Press]] \|edition=3rd \|isbn=978-0-521-88068-8 \|year=2007 \|page=[https://archive.org/details/numericalrecipes00pres_033/page/n131 108]\|title-link=Numerical Recipes }}</ref> It is very useful for large matrices over exact domains such as [[finite field]]s, where numerical stability is not an issue. Since Strassen's algorithm is actually used in practical numerical software and computer algebra systems, improving on the constants hidden in the [[big O notation\|big-O notation]] has its merits. A table that compares key aspects of the improved version based on recursive multiplication of 2×2-block matrices via 7 block matrix multiplications follows. As usual, <math>n</math> gives the dimensions of the matrix and <math>M</math> designates the memory size. It is an open question in [[theoretical computer science]] how well Strassen's algorithm can be improved in terms of [[Time complexity\|asymptotic complexity]]. The ''matrix multiplication exponent'', usually denoted <math>\omega</math>, is the smallest real number for which any <math>n\times n</math> matrix over a field can be multiplied together using <math>n^{\omega + o(1)}</math> field operations. The current best bound on <math>\omega</math> is <math>\omega < 2.3728596</math>, by Josh Alman and [[Virginia Vassilevska Williams]].<ref name="aw20" /> This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–Winograd algorithm, which was given by [[Don Coppersmith]] and [[Shmuel Winograd]] in 1990.<ref name="coppersmith">{{Citation\|doi=10.1016/S0747-7171(08)80013-2 \|title=Matrix multiplication via arithmetic progressions \|url=http://www.cs.umd.edu/~gasarch/TOPICS/ramsey/matrixmult.pdf \|year=1990 \|last1=Coppersmith \|first1=Don \|last2=Winograd \|first2=Shmuel \|journal=Journal of Symbolic Computation \|volume=9\|issue=3\|pages=251\|doi-access=free }}</ref> The conceptual idea of these algorithms are similar to Strassen's algorithm: a way is devised for multiplying two {{math\|''k'' × ''k''}}-matrices with fewer than {{math\|''k''<sup>3</sup>}} multiplications, and this technique is applied recursively. However, the constant coefficient hidden by the [[Big O notation]] is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers.<ref>{{citation▼ {\| class="wikitable" \|+ Progress for Strassen-like recursive 2x2-block matrix multiplication \|- ! Year !! Reference !! #matrix multiplications per step !! #matrix additions per step !! total arithmetic operations !! total I/O-complexity \|- \| 1969 \|\| Strassen<ref>{{cite journal \|last=Strassen \|first=Volker \|author-link=Volker Strassen\|title=Gaussian Elimination is not Optimal \|journal=Numer. Math. \|volume=13 \|issue= 4 \|pages=354–356 \|year=1969 \|doi=10.1007/BF02165411 \|s2cid=121656251 }}</ref> \|\| 7 \|\| 18 \|\| <math>7n^{\log_2 7}-6n^2</math> \|\| <math>6\left(\frac{\sqrt{3}n}{\sqrt{M}}\right)^{\log_2 7}\cdot M-18n^2 +3M</math> \|- \| 1971 \|\| Winograd<ref>{{cite journal \|last=Winograd \|first=Shmuel \|author-link=Shmuel Winograd\|title=On multiplication of 2×2 matrices \|journal=Linear Algebra and Its Applications \|volume=4 \|issue= 4 \|pages=381–388 \|year=1971 \|doi=10.1016/0024-3795(71)90009-7\|doi-access=free }}</ref> \|\| 7 \|\| 15 \|\| <math>6n^{\log_2 7}-5n^2</math> \|\| <math>5\left(\frac{\sqrt{3}n}{\sqrt{M}}\right)^{\log_2 7}\cdot M-15n^2 +3M</math> \|- \| 2017 \|\| Karstadt, Schwartz<ref>{{cite conference \|url=https://dl.acm.org/doi/10.1145/3087556.3087579 \|title=Matrix Multiplication, a Little Faster \|last1=Karstadt \|first1=Elaye \|last2=Schwartz \|first2=Oded \|date=July 2017 \|publisher= \|book-title=Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures \|pages=101–110 \|conference=SPAA '17 \|doi=10.1145/3087556.3087579\|url-access=subscription }}</ref> \|\| 7 \|\| 12 \|\| <math>5n^{\log_2 7}-4n^2+3n^2\log_2n</math> \|\| <math>4\left(\frac{\sqrt{3}n}{\sqrt{M}}\right)^{\log_2 7}\cdot M-12n^2 +3n^2\cdot\log_2\left(\frac{\sqrt{2}n}{\sqrt{M}}\right) +5M</math> \|- \| 2023 \|\| Schwartz, Vaknin<ref>{{cite conference \|url=https://doi.org/10.1137/22M1502719 \|title=Pebbling Game and Alternative Basis for High Performance Matrix Multiplication \|last1=Schwartz \|first1=Oded \|last2=Vaknin \|first2=Noa \|date=2023 \|publisher= \|book-title=SIAM Journal on Scientific Computing \|pages=C277–C303 \|doi=10.1137/22M1502719\|url-access=subscription }}</ref> \|\| 7 \|\| 12 \|\| <math>5n^{\log_2 7}-4n^2+1.5n^2\log_2n</math> \|\| <math>4\left(\frac{\sqrt{3}n}{\sqrt{M}}\right)^{\log_2 7}\cdot M-12n^2 +1.5n^2\cdot\log_2\left(\frac{\sqrt{2}n}{\sqrt{M}}\right) +5M</math> \|} It is known that a Strassen-like algorithm with a 2×2-block matrix step requires at least 7 block matrix multiplications. In 1976 Probert<ref>{{cite journal \|last=Probert \|first=Robert L. \|title=On the additive complexity of matrix multiplication \|journal=SIAM J. Comput. \|volume=5 \|issue= 2 \|pages=187–203 \|year=1976 \|doi=10.1137/0205016}}</ref> showed that such an algorithm requires at least 15 additions (including subtractions); however, a hidden assumption was that the blocks and the 2×2-block matrix are represented in the same basis. Karstadt and Schwartz computed in different bases and traded 3 additions for less expensive basis transformations. They also proved that one cannot go below 12 additions per step using different bases. In subsequent work Beniamini et el.<ref>{{cite arXiv\|eprint=2008.03759 \|last1=Beniamini \|first1=Gal \|last2=Cheng \|first2=Nathan \|last3=Holtz \|first3=Olga \|author3-link=Olga Holtz\|last4=Karstadt \|first4=Elaye \|last5=Schwartz \|first5=Oded \|title=Sparsifying the Operators of Fast Matrix Multiplication Algorithms \|date=2020 \|class=cs.DS }}</ref> applied this base-change trick to more general decompositions than 2×2-block matrices and improved the leading constant for their run times. ▲It is an open question in [[theoretical computer science]] how well Strassen's algorithm can be improved in terms of [[Time complexity\|asymptotic complexity]]. The ''matrix multiplication exponent'', usually denoted <math>\omega</math>, is the smallest real number for which any <math>n\times n</math> matrix over a field can be multiplied together using <math>n^{\omega + o(1)}</math> field operations. The current best bound on <math>\omega</math> is <math>\omega < 2.~~3728596~~371552</math>, by ~~Josh Alman and~~ [[Virginia Vassilevska Williams\|Williams]], Xu, Xu, and Zhou.<ref name="apr24w"/><ref name="aw20" /> This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–Winograd algorithm, which was given by [[Don Coppersmith]] and [[Shmuel Winograd]] in 1990.<ref name="coppersmith">{{Citation\|doi=10.1016/S0747-7171(08)80013-2 \|title=Matrix multiplication via arithmetic progressions \|url=http://www.cs.umd.edu/~gasarch/TOPICS/ramsey/matrixmult.pdf \|year=1990 \|last1=Coppersmith \|first1=Don \|author1-link=Don Coppersmith\|last2=Winograd \|first2=Shmuel\|author2-link=Shmuel Winograd \|journal=Journal of Symbolic Computation \|volume=9\|issue=3\|pages=251\|doi-access=free }}</ref> The conceptual idea of these algorithms ~~are~~is similar to Strassen's algorithm: a way is devised for multiplying two {{math\|''k'' × ''k''}}-matrices with fewer than {{math\|''k''<sup>3</sup>}} multiplications, and this technique is applied recursively. However, the constant coefficient hidden by the [[Big O notation\|big-O notation]] is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers.<ref>{{citation \| last = Iliopoulos \| first = Costas S. Line 176 ⟶ 209: \| archive-date = 2014-03-05 \| url-status = dead }}</ref><ref name="robinson">{{Citation \| last=Robinson \| first=Sara \| title=Toward an Optimal Algorithm for Matrix Multiplication \| url=https://archive.siam.org/pdf/news/174.pdf \| date=November 2005 \| journal=SIAM News \| volume=38 \| issue=9 \| quote=Even if someone manages to prove one of the conjectures—thereby demonstrating that {{math\|1=''ω'' = 2}}—the wreath product approach is unlikely to be applicable to the large matrix problems that arise in practice. [...] the input matrices must be astronomically large for the difference in time to be apparent.}}</ref> Victor Pan proposed so-called feasible sub-cubic matrix multiplication algorithms with an exponent slightly above 2.77, but in return with a much smaller hidden constant coefficient. <ref>{{Citation \| last1=Laderman \| first1=Julian \| last2=Pan \| first2=Victor \|last3=Sha \| first3=Xuan-He \| title=On practical algorithms for accelerated matrix multiplication \| year=1992 \| journal=Linear Algebra and Its Applications \| volume=162-164 \| pages=557–588 \| doi=10.1016/0024-3795(92)90393-O}}</ref> [[Freivalds' algorithm]] is a simple [[Monte Carlo algorithm]] that, given matrices {{mvar\|A}}, {{mvar\|B}} and {{mvar\|C}}, verifies in {{math\|Θ(''n''<sup>2</sup>)}} time if {{math\|''AB'' {{=}} ''C''}}. === AlphaTensor === In 2022, [[DeepMind]] introduced AlphaTensor, a [[neural network]] that used a single-player game analogy to invent thousands of matrix multiplication algorithms, including some previously discovered by humans and some that were not.<ref>{{Cite web \|title=Discovering novel algorithms with AlphaTensor \|url=https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor \|access-date=2022-11-01 \|website=www.deepmind.com \|date=5 October 2022 \|language=en}}</ref> Operations were restricted to the {{Clarify\|reason= which field is that?\|text= non-commutative ground field\|date= May 2025}}(normal arithmetic) and [[GF(2)\|finite field <math>\mathbb Z/2\mathbb Z</math>]] (mod 2 arithmetic). The best "practical" (explicit low-rank decomposition of a matrix multiplication tensor) algorithm found ran in O(n<sup>2.778</sup>).<ref name="alphatensor">{{Cite journal \|last1=Fawzi \|first1=Alhussein \|last2=Balog \|first2=Matej \|last3=Huang \|first3=Aja \|last4=Hubert \|first4=Thomas \|last5=Romera-Paredes \|first5=Bernardino \|last6=Barekatain \|first6=Mohammadamin \|last7=Novikov \|first7=Alexander \|last8=R. Ruiz \|first8=Francisco J. \|last9=Schrittwieser \|first9=Julian \|last10=Swirszcz \|first10=Grzegorz \|last11=Silver \|first11=David \|last12=Hassabis \|first12=Demis \|last13=Kohli \|first13=Pushmeet \|date=October 2022 \|title=Discovering faster matrix multiplication algorithms with reinforcement learning \|journal=Nature \|volume=610 \|issue=7930 \|pages=47–53 \|doi=10.1038/s41586-022-05172-4 \|pmid=36198780 \|pmc=9534758 \|bibcode=2022Natur.610...47F \|issn=1476-4687}}</ref> Finding low-rank decompositions of such tensors (and beyond) is NP-hard; optimal multiplication even for ~~3x3~~3×3 matrices [[Computational complexity of matrix multiplication#Minimizing number of multiplications\|remains unknown]], even in commutative field.<ref name="alphatensor"/> On ~~4x4~~4×4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved ~~5x5~~5×5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent ~~4x4~~4×4 algorithm, and separately tweaked Deepmind's 96-step ~~5x5~~5×5 algorithm down to 95 steps in mod 2 arithmetic and to 97<ref>{{Cite arXiv \|last1=Kauers \|first1=Manuel \|last2=Moosbauer \|first2=Jakob \|date=2022-12-02 \|title=Flip Graphs for Matrix Multiplication \|class=cs.SC \|eprint=2212.01175 }}</ref> in normal arithmetic.<ref>{{cite news \|last1=Brubaker \|first1=Ben \|title=AI Reveals New Possibilities in Matrix Multiplication \|url=https://www.quantamagazine.org/ai-reveals-new-possibilities-in-matrix-multiplication-20221123/ \|access-date=26 November 2022 \|work=Quanta Magazine \|date=23 November 2022 \|language=en}}</ref> Some algorithms were ~~never~~completely ~~discovered~~new: ~~before~~for example, ~~e.g.~~ (4, 5, 5) ~~got~~was improved to 76 steps from a baseline of 80 in both normal and mod 2 ~~arithmetics~~arithmetic. ==Parallel and distributed algorithms== Line 242 ⟶ 275: On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the ''communication bandwidth''. The naïve algorithm using three nested loops uses {{math\|Ω(''n''<sup>3</sup>)}} communication bandwidth. [[Cannon's algorithm]], also known as the ''2D algorithm'', is a [[communication-avoiding algorithm]] that partitions each input matrix into a block matrix whose elements are submatrices of size {{math\|{{sqrt\|''M''/3}}}} by {{math\|{{sqrt\|''M''/3}}}}, where {{math\|''M''}} is the size of fast memory.<ref>{{cite thesis \|first=Lynn Elliot \|last=Cannon \|title=A cellular computer to implement the Kalman Filter Algorithm \|date=14 July 1969 \|type=Ph.D. \|publisher=Montana State University \|url=https://dl.acm.org/doi/abs/10.5555/905686 }}</ref> The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth to {{math\|''O''(''n''<sup>3</sup>/{{sqrt\|''M''}})}}, which is asymptotically optimal (for algorithms performing {{math\|Ω(''n''<sup>3</sup>)}} computation).<ref>{{cite ~~journal~~book\|last1=Hong\|first1=J. W.\|first2=H. T. \|last2=Kung\|~~title~~author2-link=H. T. Kung\|chapter=I/O complexity: The red-blue pebble game \|~~journal~~title=~~STOC '81:~~ Proceedings of the ~~Thirteenth~~thirteenth ~~Annual~~annual ACM ~~Symposium~~symposium on Theory of ~~Computing~~computing - STOC '81 \|year=1981\|pages=326–333\|chapter-url=https://apps.dtic.mil/dtic/tr/fulltext/u2/a104739.pdf\|archive-url=https://web.archive.org/web/20191215182754/https://apps.dtic.mil/dtic/tr/fulltext/u2/a104739.pdf\|url-status=live\|archive-date=December 15, 2019 \|doi=10.1145/800076.802486\|s2cid=8410593 }}</ref><ref name=irony>{{cite journal\|last1=Irony\|first1=Dror\|first2=Sivan \|last2=Toledo \|first3=Alexander \|last3=Tiskin \|title=Communication lower bounds for distributed-memory matrix multiplication\|journal=J. Parallel Distrib. Comput.\|date=September 2004\|volume=64\|issue=9\|pages=1017–26\|doi=10.1016/j.jpdc.2004.03.021\|citeseerx=10.1.1.20.7034}}</ref> In a distributed setting with {{mvar\|p}} processors arranged in a {{math\|{{sqrt\|''p''}}}} by {{math\|{{sqrt\|''p''}}}} 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting {{math\|''O''(''n''<sup>2</sup>/{{sqrt\|''p''}})}} words, which is asymptotically optimal assuming that each node stores the minimum {{math\|''O''(''n''<sup>2</sup>/''p'')}} elements.<ref name=irony/> This can be improved by the ''3D algorithm,'' which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row.<ref name="Agarwal">{{cite journal\|last1=Agarwal\|first1=R.C.\|first2=S. M. \|last2=Balle \|first3=F. G. \|last3=Gustavson \|first4=M. \|last4=Joshi \|first5=P. \|last5=Palkar \|title=A three-dimensional approach to parallel matrix multiplication\|journal=IBM J. Res. Dev.\|date=September 1995\|volume=39\|issue=5\|pages=575–582\|doi=10.1147/rd.395.0575\|citeseerx=10.1.1.44.3404}}</ref> This algorithm transmits {{math\|''O''(''n''<sup>2</sup>/''p''<sup>2/3</sup>)}} words per processor, which is asymptotically optimal.<ref name=irony/> However, this requires replicating each input matrix element {{math\|''p''<sup>1/3</sup>}} times, and so requires a factor of {{math\|''p''<sup>1/3</sup>}} more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime.<ref name="Agarwal"/> "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth.<ref>{{cite conference\|last1=Solomonik\|first1=Edgar\|first2=James \|last2=Demmel\|author2-link=James Demmel\|title=Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms\|book-title=Proceedings of the 17th International Conference on Parallel Processing\|year=2011\|volume=Part II\|pages=90–109 \|doi=10.1007/978-3-642-23397-5_10 \|isbn=978-3-642-23397-5 \|url=https://solomonik.cs.illinois.edu/talks/europar-sep-2011.pdf}}</ref> On modern distributed computing environments such as [[MapReduce]], specialized multiplication algorithms have been developed.<ref>{{cite web \|last1=Bosagh Zadeh\|first1=Reza\|last2=Carlsson\|first2=Gunnar\|title=Dimension Independent Matrix Square Using MapReduce\|year=2013\|arxiv=1304.1467\|bibcode=2013arXiv1304.1467B\|url=https://stanford.edu/~rezab/papers/dimsum.pdf\|access-date=12 July 2014}}</ref> ===Algorithms for meshes=== Line 252 ⟶ 285: The result is even faster on a two-layered cross-wired mesh, where only 2''n''-1 steps are needed.<ref>{{cite journal \| last1 = Kak \| first1 = S \| year = 1988 \| title = A two-layered mesh array for matrix multiplication \| journal = Parallel Computing \| volume = 6 \| issue = 3\| pages = 383–5 \| doi = 10.1016/0167-8191(88)90078-6 \| citeseerx = 10.1.1.88.8527 }}</ref> The performance improves further for repeated computations leading to 100% efficiency.<ref>{{cite arXiv \|last=Kak \|first=S. \|date=2014 \|title=Efficiency of matrix multiplication on the cross-wired mesh array \|class=cs.DC \|eprint=1411.3273}}</ref> The cross-wired mesh array may be seen as a special case of a non-planar (i.e. multilayered) processing structure.<ref>{{cite journal \| last1 = Kak \| first1 = S \| year = 1988 \| title = Multilayered array computing \| journal = Information Sciences \| volume = 45 \| issue = 3\| pages = 347–365 \| doi = 10.1016/0020-0255(88)90010-2 \| citeseerx = 10.1.1.90.4753 }}</ref> In a 3D mesh with ''n''<sup>3</sup> processing elements, two matrices can be multiplied in <math>\mathcal{O}(\log n)</math> using the DNS algorithm.<ref>{{cite journal \| last1 = Dekel \| first1 = Eliezer \| last2 = Nassimi \| first2 = David \| last3 = Sahni \| first3 = Sartaj \| year = 1981 \| title = Parallel Matrix and Graph Algorithms \| journal = SIAM Journal on Computing \| volume = 10 \| issue = 4 \| pages=657–675 \| doi = 10.1137/0210049}}</ref> ==See also== Line 258 ⟶ 293: * [[CYK algorithm#Valiant's algorithm\|CYK algorithm § Valiant's algorithm]] * [[Matrix chain multiplication]] * [[Sparse matrix–vector multiplication]]▼ * [[Method of Four Russians]] * [[Multiplication algorithm]] ▲* [[Sparse matrix–vector multiplication]] ==References==