Revision as of 14:22, 29 October 2021 edit Rvdgeijn (talk \| contribs) 39 edits m corrected typo ← Previous edit		Revision as of 15:34, 30 October 2021 edit undo Decimal64 (talk \| contribs) 2 edits →Batched BLAS: Initial version Next edit →
Line 132: ==Sparse BLAS== Several extensions to BLAS for handling [[Sparse matrix\|sparse matrices]] have been suggested over the course of the library's history; a small set of sparse matrix kernel routines was finally standardized in 2002.<ref>{{cite journal \|first1=Iain S. \|last1=Duff \|first2=Michael A. \|last2=Heroux \|first3=Roldan \|last3=Pozo \|title=An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum \|journal= ACM Transactions on Mathematical Software\|year=2002 \|volume=28 \|issue=2 \|pages=239–267 \|doi=10.1145/567806.567810\|s2cid=9411006 }}</ref> ==Batched BLAS== The traditional BLAS functions have been also ported to architectures that support large amounts of parallelism such as [[GPUs]]. Here, the traditional BLAS functions provide typically good performance for large matrices. However, when computing e.g., matrix-matrix-products of many small matrices by using the GEMM routine, those architectures show significant performance losses. To address this issue, in 2017 a batched version of the BLAS function has been specified.<ref name="dongarra17">{{cite journal \|last1=Dongarra \|first1=Jack \|last2=Hammarling \|first2=Sven \|last3=Higham \|first3=Nicholas J. \|last4=Relton \|first4=Samuel D. \|last5=Valero-Lara \|first5=Pedro \|last6=Zounon \|first6=Mawussi \|title=The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems \|journal=Procedia Computer Science \|volume=108 \|pages=495-504 \|date=2017 \|doi=10.1016/j.procs.2017.05.138}}</ref> Taking the GEMM routine from above as an example, the batched version performs the following computation simultaneously for many matrices: <math>\boldsymbol{C}[k] \leftarrow \alpha \boldsymbol{A}[k] \boldsymbol{B}[k] + \beta \boldsymbol{C}[k] \quad \forall k </math> The index <math>k</math> in square brackets indicates that the operation is performed for all matrices <math>k</math> in a stack. Often, this operation is implemented for a strided batched memory layout where all matrices follow concatenated in the arrays <math>A</math>, <math>B</math> and <math>C</math>. Batched BLAS functions can be a versatile tool and allow e.g. a fast implementation of [[exponential integrators]] and [[Magnus integrators]] that handle long integration periods with many time steps.<ref name="herb21">{{cite journal \|last1=Herb \|first1=Konstantin \|last2=Welter \|first2=Pol \|title=Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines \|journal=Computer Physics Communications \|volume=270 \|pages=108181 \|date=2022 \|doi=10.1016/j.cpc.2021.108181 \|arxiv=2108.07126}}</ref> Here, the [[matrix exponentiation]], the computationally expensive part of the integration, can be implemented in parallel for all time-steps by using Batched BLAS functions. ==See also==

Basic Linear Algebra Subprograms: Difference between revisions