Revision as of 20:23, 4 July 2021 edit ChromeGames (talk \| contribs) Extended confirmed users, Pending changes reviewers 10,062 edits Importing Wikidata short description: "Routines for performing common linear algebra operations" (Shortdesc helper) ← Previous edit		Revision as of 20:37, 31 July 2021 edit undo 188.242.178.101 (talk) →Level 3: add link to the triangular matrix article Tag: Visual edit Next edit →
Line 84: :<math>\boldsymbol{B} \leftarrow \alpha \boldsymbol{T}^{-1} \boldsymbol{B},</math> where {{math\|'''''T'''''}} is a [[triangular matrix]], among other functionality. Due to the ubiquity of matrix multiplications in many scientific applications, including for the implementation of the rest of Level 3 BLAS,<ref name="Geijn_2008"/> and because faster algorithms exist beyond the obvious repetition of matrix-vector multiplication, <code>gemm</code> is a prime target of optimization for BLAS implementers. E.g., by decomposing one or both of {{math\|'''''A'''''}}, {{math\|'''''B'''''}} into [[Block matrix\|block matrices]], <code>gemm</code> can be [[Matrix multiplication algorithm#Divide-and-conquer algorithm\|implemented recursively]]. This is one of the motivations for including the {{math\|''β''}} parameter,{{dubious\|Reason for beta parameter\|date=January 2015}} so the results of previous blocks can be accumulated. Note that this decomposition requires the special case {{math\|''β'' {{=}} 1}} which many implementations optimize for, thereby eliminating one multiplication for each value of {{math\|'''''C'''''}}. This decomposition allows for better [[locality of reference]] both in space and time of the data used in the product. This, in turn, takes advantage of the [[CPU cache\|cache]] on the system.<ref>{{Citation \| last1=Golub \| first1=Gene H. \| author1-link=Gene H. Golub \| last2=Van Loan \| first2=Charles F. \| author2-link=Charles F. Van Loan \| title=Matrix Computations \| publisher=Johns Hopkins \| edition=3rd \| isbn=978-0-8018-5414-9 \|date=1996}}</ref> For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as [[Automatically Tuned Linear Algebra Software\|ATLAS]]. More recently, implementations by [[Kazushige Goto]] have shown that blocking only for the [[L2 cache]], combined with careful [[amortized analysis\|amortizing]] of copying to contiguous memory to reduce [[translation lookaside buffer\|TLB]] misses, is superior to [[Automatically Tuned Linear Algebra Software\|ATLAS]].<ref name="Kazushige_2008"/> A highly tuned implementation based on these ideas is part of the [[GotoBLAS]], [[OpenBLAS]] and [[BLIS (software)\|BLIS]].

Basic Linear Algebra Subprograms: Difference between revisions