Revision as of 00:42, 22 May 2021 edit Garymm (talk \| contribs) 83 edits m fix typo Tag: Visual edit ← Previous edit		Revision as of 01:29, 24 May 2021 edit undo Cewbot (talk \| contribs) Bots 8,374,869 edits m Fix broken anchor: 2021-01-25 #Divide and conquer algorithm→Matrix multiplication algorithm#Divide-and-conquer algorithm Next edit →
Line 85: where {{math\|'''''T'''''}} is a triangular matrix, among other functionality. Due to the ubiquity of matrix multiplications in many scientific applications, including for the implementation of the rest of Level 3 BLAS,<ref name="Geijn_2008"/> and because faster algorithms exist beyond the obvious repetition of matrix-vector multiplication, <code>gemm</code> is a prime target of optimization for BLAS implementers. E.g., by decomposing one or both of {{math\|'''''A'''''}}, {{math\|'''''B'''''}} into [[Block matrix\|block matrices]], <code>gemm</code> can be [[Matrix multiplication algorithm#Divide -and -conquer algorithm\|implemented recursively]]. This is one of the motivations for including the {{math\|''β''}} parameter,{{dubious\|Reason for beta parameter\|date=January 2015}} so the results of previous blocks can be accumulated. Note that this decomposition requires the special case {{math\|''β'' {{=}} 1}} which many implementations optimize for, thereby eliminating one multiplication for each value of {{math\|'''''C'''''}}. This decomposition allows for better [[locality of reference]] both in space and time of the data used in the product. This, in turn, takes advantage of the [[CPU cache\|cache]] on the system.<ref>{{Citation \| last1=Golub \| first1=Gene H. \| author1-link=Gene H. Golub \| last2=Van Loan \| first2=Charles F. \| author2-link=Charles F. Van Loan \| title=Matrix Computations \| publisher=Johns Hopkins \| edition=3rd \| isbn=978-0-8018-5414-9 \|date=1996}}</ref> For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as [[Automatically Tuned Linear Algebra Software\|ATLAS]]. More recently, implementations by [[Kazushige Goto]] have shown that blocking only for the [[L2 cache]], combined with careful [[amortized analysis\|amortizing]] of copying to contiguous memory to reduce [[translation lookaside buffer\|TLB]] misses, is superior to [[Automatically Tuned Linear Algebra Software\|ATLAS]].<ref name="Kazushige_2008"/> A highly tuned implementation based on these ideas is part of the [[GotoBLAS]], [[OpenBLAS]] and [[BLIS (software)\|BLIS]]. A common variation of {{code\|gemm}} is the {{code\|gemm3m}}, which calculates a complex product using "three real matrix multiplications and five real matrix additions instead of the conventional four real matrix multiplications and two real matrix additions", an algorithm similar to [[Strassen algorithm]] first described by Peter Ungar.<ref>{{cite journal \|last1=Van Zee \|first1=Field G. \|last2=Smith \|first2=Tyler M. \|title=Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods \|journal=ACM Transactions on Mathematical Software \|date=24 July 2017 \|volume=44 \|issue=1 \|pages=1–36 \|doi=10.1145/3086466\|s2cid=25580883 }}</ref>

Basic Linear Algebra Subprograms: Difference between revisions