Content deleted Content added
→cite arxiv, tweak cites | Alter: template type. Add: class, pmc, pmid, authors 1-1. Removed proxy/dead URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. | Use this tool. Report bugs. | #UCB_Gadget |
→cite thesis, tweak cites | Add: s2cid. | Use this tool. Report bugs. | #UCB_Gadget |
||
Line 2:
Because [[matrix multiplication]] is such a central operation in many [[numerical algorithm]]s, much work has been invested in making '''matrix multiplication algorithms''' efficient. Applications of matrix multiplication in computational problems are found in many fields including [[scientific computing]] and [[pattern recognition]] and in seemingly unrelated problems such as counting the paths through a [[Graph (graph theory)|graph]].<ref name="skiena"/> Many different algorithms have been designed for multiplying matrices on different types of hardware, including [[parallel computing|parallel]] and [[distributed computing|distributed]] systems, where the computational work is spread over multiple processors (perhaps over a network).
Directly applying the mathematical definition of matrix multiplication gives an algorithm that [[Analysis of algorithms|takes time]] on the order of {{math|''n''<sup>3</sup>}} [[Field (mathematics)|field]] operations to multiply two {{math|''n'' × ''n''}} matrices over that field ({{math|Θ(''n''<sup>3</sup>)}} in [[big O notation]]). Better asymptotic bounds on the time required to multiply matrices have been known since the [[Strassen algorithm|Strassen's algorithm]] in the 1960s, but the optimal time (that is, the [[computational complexity of matrix multiplication]]) remains unknown. {{As of|2020|12}}, the matrix multiplication algorithm with best asymptotic complexity runs in {{math|O(''n''<sup>2.3728596</sup>)}} time, given by Josh Alman and [[Virginia Vassilevska Williams]].<ref name="aw20">{{
In 2022, [[DeepMind]] introduced AlphaTensor, a [[neural network]] that used a single-player game analogy to invent thousands of matrix multiplication algorithms, including some previously discovered by humans.<ref>{{Cite web |title=Discovering novel algorithms with AlphaTensor |url=https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor |access-date=2022-11-01 |website=www.deepmind.com |language=en}}</ref> The best practical algorithm runs in O(n<sup>2.778</sup>) for matrices in the [[GF(2)|finite field <math>\mathbb Z/2\mathbb Z</math>]].<ref>{{Cite journal |last1=Fawzi |first1=Alhussein |last2=Balog |first2=Matej |last3=Huang |first3=Aja |last4=Hubert |first4=Thomas |last5=Romera-Paredes |first5=Bernardino |last6=Barekatain |first6=Mohammadamin |last7=Novikov |first7=Alexander |last8=R. Ruiz |first8=Francisco J. |last9=Schrittwieser |first9=Julian |last10=Swirszcz |first10=Grzegorz |last11=Silver |first11=David |last12=Hassabis |first12=Demis |last13=Kohli |first13=Pushmeet |date=October 2022 |title=Discovering faster matrix multiplication algorithms with reinforcement learning |journal=Nature |volume=610 |issue=7930 |pages=47–53 |doi=10.1038/s41586-022-05172-4 |pmid=36198780 |pmc=9534758 |issn=1476-4687}}</ref>
Line 211:
Here, ''fork'' is a keyword that signal a computation may be run in parallel with the rest of the function call, while ''join'' waits for all previously "forked" computations to complete. {{math|partition}} achieves its goal by pointer manipulation only.
This algorithm has a [[critical path length]] of {{math|Θ(log<sup>2</sup> ''n'')}} steps, meaning it takes that much time on an ideal machine with an infinite number of processors; therefore, it has a maximum possible [[speedup]] of {{math|Θ(''n''<sup>3</sup>/log<sup>2</sup> ''n'')}} on any real computer. The algorithm isn't practical due to the communication cost inherent in moving data to and from the temporary matrix {{mvar|T}}, but a more practical variant achieves {{math|Θ(''n''<sup>2</sup>)}} speedup, without using a temporary matrix.<ref name="cilk">{{cite thesis |type=Ph.D. |last=Randall |first=Keith H. |title=Cilk: Efficient Multithreaded Computing |publisher=Massachusetts Institute of Technology |year=1998 |pages=54–57 |url=http://supertech.csail.mit.edu/papers/randall-phdthesis.pdf }}</ref>
[[File:Block matrix multiplication.svg|thumb|Block matrix multiplication. In the 2D algorithm, each processor is responsible for one submatrix of {{mvar|C}}. In the 3D algorithm, every pair of submatrices from {{mvar|A}} and {{mvar|B}} that is multiplied is assigned to one processor.]]
Line 218:
On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the ''communication bandwidth''. The naïve algorithm using three nested loops uses {{math|Ω(''n''<sup>3</sup>)}} communication bandwidth.
[[Cannon's algorithm]], also known as the ''2D algorithm'', is a [[communication-avoiding algorithm]] that partitions each input matrix into a block matrix whose elements are submatrices of size {{math|{{sqrt|''M''/3}}}} by {{math|{{sqrt|''M''/3}}}}, where {{math|''M''}} is the size of fast memory.<ref>{{cite thesis |first=Lynn Elliot |last=Cannon
In a distributed setting with {{mvar|p}} processors arranged in a {{math|{{sqrt|''p''}}}} by {{math|{{sqrt|''p''}}}} 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting {{math|''O''(''n''<sup>2</sup>/{{sqrt|''p''}})}} words, which is asymptotically optimal assuming that each node stores the minimum {{math|''O''(''n''<sup>2</sup>/''p'')}} elements.<ref name=irony/> This can be improved by the ''3D algorithm,'' which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row.<ref name="Agarwal">{{cite journal|last1=Agarwal|first1=R.C.|first2=S. M. |last2=Balle |first3=F. G. |last3=Gustavson |first4=M. |last4=Joshi |first5=P. |last5=Palkar |title=A three-dimensional approach to parallel matrix multiplication|journal=IBM J. Res. Dev.|date=September 1995|volume=39|issue=5|pages=575–582|doi=10.1147/rd.395.0575|citeseerx=10.1.1.44.3404}}</ref> This algorithm transmits {{math|''O''(''n''<sup>2</sup>/''p''<sup>2/3</sup>)}} words per processor, which is asymptotically optimal.<ref name=irony/> However, this requires replicating each input matrix element {{math|''p''<sup>1/3</sup>}} times, and so requires a factor of {{math|''p''<sup>1/3</sup>}} more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime.<ref name="Agarwal"/> "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth.<ref>{{cite
===Algorithms for meshes===
|