where {{math|'''''T'''''}} is a triangular matrix, among other functionality.
Due to the ubiquity of matrix multiplications in many scientific applications, including for the implementation of the rest of Level 3 BLAS,<ref name="Geijn_2008"/> and because faster algorithms exist beyond the obvious repetition of matrix-vector multiplication, <code>gemm</code> is a prime target of optimization for BLAS implementers. E.g., by decomposing one or both of {{math|'''''A'''''}}, {{math|'''''B'''''}} into [[Block matrix|block matrices]], <code>gemm</code> can be [[Matrix multiplication algorithm#Divide-and-conquer algorithm|implemented recursively]]. This is one of the motivations for including the {{math|''β''}} parameter,{{dubious|Reason for beta parameter|date=January 2015}} so the results of previous blocks can be accumulated. Note that this decomposition requires the special case {{math|''β'' {{=}} 1}} which many implementations optimize for, thereby eliminating one multiplication for each value of {{math|'''''C'''''}}. This decomposition allows for better [[locality of reference]] both in space and time of the data used in the product. This, in turn, takes advantage of the [[CPU cache|cache]] on the system.<ref>{{Citation | last1=Golub | first1=Gene H. | author1-link=Gene H. Golub | last2=Van Loan | first2=Charles F. | author2-link=Charles F. Van Loan | title=Matrix Computations | publisher=Johns Hopkins | edition=3rd | isbn=978-0-8018-5414-9 |date=1996}}</ref> For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as [[Automatically Tuned Linear Algebra Software|ATLAS]]. More recently, implementations by [[Kazushige Goto]] have shown that blocking only for the [[L2 cache]], combined with careful [[amortized analysis|amortizing]] of copying to contiguous memory to reduce [[translation lookaside buffer|TLB]] misses, is superior to [[Automatically Tuned Linear Algebra Software|ATLAS]].<ref name="Kazushige_2008"/> A highly tuned implementation based on these ideas is part of the [[GotoBLAS]], [[OpenBLAS]] and [[BLIS (software)|BLIS]].
A common variation of {{code|gemm}} is the {{code|gemm3m}}, which calculates a complex product using "three real matrix multiplications and five real matrix additions instead of the conventional four real matrix multiplications and two real matrix additions", an algorithm similar to [[Strassen algorithm]] first described by Peter Ungar.<ref>{{cite journal |last1=Van Zee |first1=Field G. |last2=Smith |first2=Tyler M. |title=Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods |journal=ACM Transactions on Mathematical Software |date=24 July 2017 |volume=44 |issue=1 |pages=1–36 |doi=10.1145/3086466|s2cid=25580883 }}</ref>