Matrix multiplication algorithm: Difference between revisions

Content deleted Content added
No edit summary
Tags: Reverted blanking
Reverted 1 edit by 88.235.200.138 (talk): Unexplained removal of content
Line 158:
Algorithms exist that provide better running times than the straightforward ones. The first to be discovered was [[Strassen algorithm|Strassen's algorithm]], devised by [[Volker Strassen]] in 1969 and often referred to as "fast matrix multiplication". It is based on a way of multiplying two {{math|2 × 2}}-matrices which requires only 7 multiplications (instead of the usual 8), at the expense of several additional addition and subtraction operations. Applying this recursively gives an algorithm with a multiplicative cost of <math>O( n^{\log_{2}7}) \approx O(n^{2.807})</math>. Strassen's algorithm is more complex, and the [[numerical stability]] is reduced compared to the naïve algorithm,<ref>{{Citation | last1=Miller | first1=Webb | title=Computational complexity and numerical stability | citeseerx = 10.1.1.148.9947 | year=1975 | journal=SIAM News | volume=4 | issue=2 | pages=97–107 | doi=10.1137/0204009}}</ref> but it is faster in cases where {{math|''n'' > 100}} or so<ref name="skiena"/> and appears in several libraries, such as [[Basic Linear Algebra Subprograms|BLAS]].<ref>{{cite book |last1=Press |first1=William H. |last2=Flannery |first2=Brian P. |last3=Teukolsky |first3=Saul A. |author3-link=Saul Teukolsky |last4=Vetterling |first4=William T. |title=Numerical Recipes: The Art of Scientific Computing |publisher=[[Cambridge University Press]] |edition=3rd |isbn=978-0-521-88068-8 |year=2007 |page=[https://archive.org/details/numericalrecipes00pres_033/page/n131 108]|title-link=Numerical Recipes }}</ref> It is very useful for large matrices over exact domains such as [[finite field]]s, where numerical stability is not an issue.
 
It is an open question in [[theoretical computer science]] how well Strassen's algorithm can be improved in terms of [[Time complexity|asymptotic complexity]]. The ''matrix multiplication exponent'', usually denoted <math>\omega</math>, is the smallest real number for which any <math>n\times n</math> matrix over a field can be multiplied together using <math>n^{\omega + o(1)}</math> field operations. The current best bound on <math>\omega</math> is <math>\omega < 2.3728596</math>, by Josh Alman and [[Virginia Vassilevska Williams]].<ref name="aw20" /> This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–Winograd algorithm, which was given by [[Don Coppersmith]] and [[Shmuel Winograd]] in 1990.<ref name="coppersmith">{{Citation|doi=10.1016/S0747-7171(08)80013-2 |title=Matrix multiplication via arithmetic progressions |url=http://www.cs.umd.edu/~gasarch/TOPICS/ramsey/matrixmult.pdf |year=1990 |last1=Coppersmith |first1=Don |last2=Winograd |first2=Shmuel |journal=Journal of Symbolic Computation |volume=9|issue=3|pages=251|doi-access=free }}</ref> The conceptual idea of these algorithms are similar to Strassen's algorithm: a way is devised for multiplying two {{math|''k'' × ''k''}}-matrices with fewer than {{math|''k''<sup>3</sup>}} multiplications, and this technique is applied recursively. However, the constant coefficient hidden by the [[Big O notation]] is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers.<ref>{{citation
It is an open question in [[theoretical computer science]] how well Strassen's algorithm can be improved in terms of [[Time complexity|asymptotic complexity]]. The ''matrix multiplication exponent'', u
| last = Iliopoulos
| first = Costas S.
| doi = 10.1137/0218045
| issue = 4
| journal = SIAM Journal on Computing
| mr = 1004789
| quote = The Coppersmith–Winograd algorithm is not practical, due to the very large hidden constant in the upper bound on the number of multiplications required.
| pages = 658–669
| title = Worst-case complexity bounds on algorithms for computing the canonical structure of finite abelian groups and the Hermite and Smith normal forms of an integer matrix
| url = http://www.williamstein.org/home/wstein/www/home/pernet/Papers/Hermite/Iliopoulos88.pdf
| volume = 18
| year = 1989
| citeseerx = 10.1.1.531.9309
| access-date = 2015-01-16
| archive-url = https://web.archive.org/web/20140305182030/http://www.williamstein.org/home/wstein/www/home/pernet/Papers/Hermite/Iliopoulos88.pdf
| archive-date = 2014-03-05
| url-status = dead
}}</ref><ref name="robinson">{{Citation | last=Robinson | first=Sara | title=Toward an Optimal Algorithm for Matrix Multiplication | url=https://archive.siam.org/pdf/news/174.pdf | date=November 2005 | journal=SIAM News | volume=38 | issue=9 | quote=Even if someone manages to prove one of the conjectures—thereby demonstrating that {{math|1=''ω'' = 2}}—the wreath product approach is unlikely to be applicable to the large matrix problems that arise in practice. [...] the input matrices must be astronomically large for the difference in time to be apparent.}}</ref>
 
[[Freivalds' algorithm]] is a simple [[Monte Carlo algorithm]] that, given matrices {{mvar|A}}, {{mvar|B}} and {{mvar|C}}, verifies in {{math|Θ(''n''<sup>2</sup>)}} time if {{math|''AB'' {{=}} ''C''}}.
 
=== AlphaTensor ===
 
In 2022, [[DeepMind]] introduced AlphaTensor, a [[neural network]] that used a single-player game analogy to invent thousands of matrix multiplication algorithms, including some previously discovered by humans and some that were not.<ref>{{Cite web |title=Discovering novel algorithms with AlphaTensor |url=https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor |access-date=2022-11-01 |website=www.deepmind.com |language=en}}</ref> Operations were restricted to the non-commutative ground field (normal arithmetic) and [[GF(2)|finite field <math>\mathbb Z/2\mathbb Z</math>]] (mod 2 arithmetic). The best "practical" (explicit low-rank decomposition of a matrix multiplication tensor) algorithm found ran in O(n<sup>2.778</sup>).<ref name="alphatensor">{{Cite journal |last1=Fawzi |first1=Alhussein |last2=Balog |first2=Matej |last3=Huang |first3=Aja |last4=Hubert |first4=Thomas |last5=Romera-Paredes |first5=Bernardino |last6=Barekatain |first6=Mohammadamin |last7=Novikov |first7=Alexander |last8=R. Ruiz |first8=Francisco J. |last9=Schrittwieser |first9=Julian |last10=Swirszcz |first10=Grzegorz |last11=Silver |first11=David |last12=Hassabis |first12=Demis |last13=Kohli |first13=Pushmeet |date=October 2022 |title=Discovering faster matrix multiplication algorithms with reinforcement learning |journal=Nature |volume=610 |issue=7930 |pages=47–53 |doi=10.1038/s41586-022-05172-4 |pmid=36198780 |pmc=9534758 |bibcode=2022Natur.610...47F |issn=1476-4687}}</ref> Finding low-rank decompositions of such tensors (and beyond) is NP-hard; optimal multiplication even for 3x3 matrices [[Computational complexity of matrix multiplication#Minimizing number of multiplications|remains unknown]], even in commutative field.<ref name="alphatensor"/> On 4x4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5x5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4x4 algorithm, and separately tweaked Deepmind's 96-step 5x5 algorithm down to 95 steps in mod 2 arithmetic and to 97<ref>{{Cite arXiv |last1=Kauers |first1=Manuel |last2=Moosbauer |first2=Jakob |date=2022-12-02 |title=Flip Graphs for Matrix Multiplication |class=cs.SC |eprint=2212.01175 }}</ref> in normal arithmetic.<ref>{{cite news |last1=Brubaker |first1=Ben |title=AI Reveals New Possibilities in Matrix Multiplication |url=https://www.quantamagazine.org/ai-reveals-new-possibilities-in-matrix-multiplication-20221123/ |access-date=26 November 2022 |work=Quanta Magazine |date=23 November 2022 |language=en}}</ref> Some algorithms were never discovered before, e.g. (4, 5, 5) got improved to 76 from 80 in normal and mod 2 arithmetics.
 
==Parallel and distributed algorithms==
 
===Shared-memory parallelism===
The [[#Divide-and-conquer algorithm|divide-and-conquer algorithm]] sketched earlier can be [[parallel algorithm|parallelized]] in two ways for [[shared-memory multiprocessor]]s. These are based on the fact that the eight recursive matrix multiplications in
 
:<math>\begin{pmatrix}
A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22}\\
A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}\\
\end{pmatrix}
</math>
 
can be performed independently of each other, as can the four summations (although the algorithm needs to "join" the multiplications before doing the summations). Exploiting the full parallelism of the problem, one obtains an algorithm that can be expressed in [[fork–join model|fork–join style]] [[pseudocode]]:<ref name="cilk"/>
 
{{framebox|blue}}
'''Procedure''' {{math|multiply(''C'', ''A'', ''B'')}}:
 
* Base case: if {{math|''n'' {{=}} 1}}, set {{math|''c''<sub>11</sub> ← ''a''<sub>11</sub> × ''b''<sub>11</sub>}} (or multiply a small block matrix).
* Otherwise, allocate space for a new matrix {{mvar|T}} of shape {{math|''n'' × ''n''}}, then:
** Partition {{mvar|A}} into {{math|''A''<sub>11</sub>}}, {{math|''A''<sub>12</sub>}}, {{math|''A''<sub>21</sub>}}, {{math|''A''<sub>22</sub>}}.
** Partition {{mvar|B}} into {{math|''B''<sub>11</sub>}}, {{math|''B''<sub>12</sub>}}, {{math|''B''<sub>21</sub>}}, {{math|''B''<sub>22</sub>}}.
** Partition {{mvar|C}} into {{math|''C''<sub>11</sub>}}, {{math|''C''<sub>12</sub>}}, {{math|''C''<sub>21</sub>}}, {{math|''C''<sub>22</sub>}}.
** Partition {{mvar|T}} into {{math|''T''<sub>11</sub>}}, {{math|''T''<sub>12</sub>}}, {{math|''T''<sub>21</sub>}}, {{math|''T''<sub>22</sub>}}.
** Parallel execution:
*** ''Fork'' {{math|multiply(''C''<sub>11</sub>, ''A''<sub>11</sub>, ''B''<sub>11</sub>)}}.
*** ''Fork'' {{math|multiply(''C''<sub>12</sub>, ''A''<sub>11</sub>, ''B''<sub>12</sub>)}}.
*** ''Fork'' {{math|multiply(''C''<sub>21</sub>, ''A''<sub>21</sub>, ''B''<sub>11</sub>)}}.
*** ''Fork'' {{math|multiply(''C''<sub>22</sub>, ''A''<sub>21</sub>, ''B''<sub>12</sub>)}}.
*** ''Fork'' {{math|multiply(''T''<sub>11</sub>, ''A''<sub>12</sub>, ''B''<sub>21</sub>)}}.
*** ''Fork'' {{math|multiply(''T''<sub>12</sub>, ''A''<sub>12</sub>, ''B''<sub>22</sub>)}}.
*** ''Fork'' {{math|multiply(''T''<sub>21</sub>, ''A''<sub>22</sub>, ''B''<sub>21</sub>)}}.
*** ''Fork'' {{math|multiply(''T''<sub>22</sub>, ''A''<sub>22</sub>, ''B''<sub>22</sub>)}}.
** ''Join'' (wait for parallel forks to complete).
** {{math|add(''C'', ''T'')}}.
** Deallocate {{mvar|T}}.
 
'''Procedure''' {{math|add(''C'', ''T'')}} adds {{mvar|T}} into {{mvar|C}}, element-wise:
 
* Base case: if {{math|''n'' {{=}} 1}}, set {{math|''c''<sub>11</sub> ← ''c''<sub>11</sub> + ''t''<sub>11</sub>}} (or do a short loop, perhaps unrolled).
* Otherwise:
** Partition {{mvar|C}} into {{math|''C''<sub>11</sub>}}, {{math|''C''<sub>12</sub>}}, {{math|''C''<sub>21</sub>}}, {{math|''C''<sub>22</sub>}}.
** Partition {{mvar|T}} into {{math|''T''<sub>11</sub>}}, {{math|''T''<sub>12</sub>}}, {{math|''T''<sub>21</sub>}}, {{math|''T''<sub>22</sub>}}.
** In parallel:
*** ''Fork'' {{math|add(''C''<sub>11</sub>, ''T''<sub>11</sub>)}}.
*** ''Fork'' {{math|add(''C''<sub>12</sub>, ''T''<sub>12</sub>)}}.
*** ''Fork'' {{math|add(''C''<sub>21</sub>, ''T''<sub>21</sub>)}}.
*** ''Fork'' {{math|add(''C''<sub>22</sub>, ''T''<sub>22</sub>)}}.
** ''Join''.
{{frame-footer}}
 
Here, ''fork'' is a keyword that signal a computation may be run in parallel with the rest of the function call, while ''join'' waits for all previously "forked" computations to complete. {{math|partition}} achieves its goal by pointer manipulation only.
 
This algorithm has a [[critical path length]] of {{math|Θ(log<sup>2</sup> ''n'')}} steps, meaning it takes that much time on an ideal machine with an infinite number of processors; therefore, it has a maximum possible [[speedup]] of {{math|Θ(''n''<sup>3</sup>/log<sup>2</sup> ''n'')}} on any real computer. The algorithm isn't practical due to the communication cost inherent in moving data to and from the temporary matrix {{mvar|T}}, but a more practical variant achieves {{math|Θ(''n''<sup>2</sup>)}} speedup, without using a temporary matrix.<ref name="cilk">{{cite thesis |type=Ph.D. |last=Randall |first=Keith H. |title=Cilk: Efficient Multithreaded Computing |publisher=Massachusetts Institute of Technology |year=1998 |pages=54–57 |url=http://supertech.csail.mit.edu/papers/randall-phdthesis.pdf |hdl=1721.1/47519}}</ref>
 
[[File:Block matrix multiplication.svg|thumb|Block matrix multiplication. In the 2D algorithm, each processor is responsible for one submatrix of {{mvar|C}}. In the 3D algorithm, every pair of submatrices from {{mvar|A}} and {{mvar|B}} that is multiplied is assigned to one processor.]]
 
===Communication-avoiding and distributed algorithms===