Content deleted Content added
ChromeGames (talk | contribs) Importing Wikidata short description: "Routines for performing common linear algebra operations" (Shortdesc helper) |
→Level 3: add link to the triangular matrix article |
||
Line 84:
:<math>\boldsymbol{B} \leftarrow \alpha \boldsymbol{T}^{-1} \boldsymbol{B},</math>
where {{math|'''''T'''''}} is a [[triangular matrix]], among other functionality.
Due to the ubiquity of matrix multiplications in many scientific applications, including for the implementation of the rest of Level 3 BLAS,<ref name="Geijn_2008"/> and because faster algorithms exist beyond the obvious repetition of matrix-vector multiplication, <code>gemm</code> is a prime target of optimization for BLAS implementers. E.g., by decomposing one or both of {{math|'''''A'''''}}, {{math|'''''B'''''}} into [[Block matrix|block matrices]], <code>gemm</code> can be [[Matrix multiplication algorithm#Divide-and-conquer algorithm|implemented recursively]]. This is one of the motivations for including the {{math|''β''}} parameter,{{dubious|Reason for beta parameter|date=January 2015}} so the results of previous blocks can be accumulated. Note that this decomposition requires the special case {{math|''β'' {{=}} 1}} which many implementations optimize for, thereby eliminating one multiplication for each value of {{math|'''''C'''''}}. This decomposition allows for better [[locality of reference]] both in space and time of the data used in the product. This, in turn, takes advantage of the [[CPU cache|cache]] on the system.<ref>{{Citation | last1=Golub | first1=Gene H. | author1-link=Gene H. Golub | last2=Van Loan | first2=Charles F. | author2-link=Charles F. Van Loan | title=Matrix Computations | publisher=Johns Hopkins | edition=3rd | isbn=978-0-8018-5414-9 |date=1996}}</ref> For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as [[Automatically Tuned Linear Algebra Software|ATLAS]]. More recently, implementations by [[Kazushige Goto]] have shown that blocking only for the [[L2 cache]], combined with careful [[amortized analysis|amortizing]] of copying to contiguous memory to reduce [[translation lookaside buffer|TLB]] misses, is superior to [[Automatically Tuned Linear Algebra Software|ATLAS]].<ref name="Kazushige_2008"/> A highly tuned implementation based on these ideas is part of the [[GotoBLAS]], [[OpenBLAS]] and [[BLIS (software)|BLIS]].
|