Content deleted Content added
m corrected typo |
→Batched BLAS: Initial version |
||
Line 132:
==Sparse BLAS==
Several extensions to BLAS for handling [[Sparse matrix|sparse matrices]] have been suggested over the course of the library's history; a small set of sparse matrix kernel routines was finally standardized in 2002.<ref>{{cite journal |first1=Iain S. |last1=Duff |first2=Michael A. |last2=Heroux |first3=Roldan |last3=Pozo |title=An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum |journal= ACM Transactions on Mathematical Software|year=2002 |volume=28 |issue=2 |pages=239–267 |doi=10.1145/567806.567810|s2cid=9411006 }}</ref>
==Batched BLAS==
The traditional BLAS functions have been also ported to architectures that support large amounts of parallelism such as [[GPUs]]. Here, the traditional BLAS functions provide typically good performance for large matrices. However, when computing e.g., matrix-matrix-products of many small matrices by using the GEMM routine, those architectures show significant performance losses. To address this issue, in 2017 a batched version of the BLAS function has been specified.<ref name="dongarra17">{{cite journal |last1=Dongarra |first1=Jack |last2=Hammarling |first2=Sven |last3=Higham |first3=Nicholas J. |last4=Relton |first4=Samuel D. |last5=Valero-Lara |first5=Pedro |last6=Zounon |first6=Mawussi |title=The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems |journal=Procedia Computer Science |volume=108 |pages=495-504 |date=2017 |doi=10.1016/j.procs.2017.05.138}}</ref>
Taking the GEMM routine from above as an example, the batched version performs the following computation simultaneously for many matrices:
<math>\boldsymbol{C}[k] \leftarrow \alpha \boldsymbol{A}[k] \boldsymbol{B}[k] + \beta \boldsymbol{C}[k] \quad \forall k </math>
The index <math>k</math> in square brackets indicates that the operation is performed for all matrices <math>k</math> in a stack. Often, this operation is implemented for a strided batched memory layout where all matrices follow concatenated in the arrays <math>A</math>, <math>B</math> and <math>C</math>.
Batched BLAS functions can be a versatile tool and allow e.g. a fast implementation of [[exponential integrators]] and [[Magnus integrators]] that handle long integration periods with many time steps.<ref name="herb21">{{cite journal |last1=Herb |first1=Konstantin |last2=Welter |first2=Pol |title=Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines |journal=Computer Physics Communications |volume=270 |pages=108181 |date=2022 |doi=10.1016/j.cpc.2021.108181 |arxiv=2108.07126}}</ref> Here, the [[matrix exponentiation]], the computationally expensive part of the integration, can be implemented in parallel for all time-steps by using Batched BLAS functions.
==See also==
|