Revision as of 23:11, 2 November 2015 edit Sing0512 (talk \| contribs) 143 edits No edit summary ← Previous edit		Revision as of 23:31, 2 November 2015 edit undo Sing0512 (talk \| contribs) 143 edits →Matrix Multiplication Tag: Visual edit Next edit →
Line 28: Modern GPU designs are mainly based on [[SIMD]] computation paradigm. This type of GPU devices are so-called [[General-purpose computing on graphics processing units\|general-purpose GPUs (GPGPUs)]]. GPGPUs are able to perform an operation on multiple independent data concurrently with their vector or SIMD functional units. With this nature, GPGPUs can be employed as DSP accelerators easily while many DSP problems can be solved by [[Divide and conquer algorithms\|divide-and-conquer]] algorithms. For example, an <{{math> \|''M'' ~~\times~~× ''M~~</math>~~''}} [[matrix multiplication]] can be processed by <{{math> \|''M'' ~~\times~~× ''M~~</math>~~''}} concurrent threads on a GPGPU device without any output data dependency. Therefore, theoretically, by means of GPGPU acceleration, we can gain up to <{{math> \|''M'' ~~\times~~× ''M~~</math>~~''}} speedup compared with a traditional CPU or digital signal processor. == Programming Languages == Line 35: ==== Matrix Multiplication ==== Suppose {{math\|'''A'''}} and {{math\|'''B'''}} are two {{math\|''m'' × ''m''}} matrices and we would like to compute {{math\|1 = '''C''' = '''A''' × '''B'''}}. ~~Suppose~~ <math>\mathbf{A}=\begin{pmatrix} A_{11} & A_{12} & \cdots & A_{1m} \\ A_{21} & A_{22} & \cdots & A_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ A_{m1} & A_{m2} & \cdots & A_{mm} \\ \end{pmatrix},\quad\mathbf{B}=\begin{pmatrix} B_{11} & B_{12} & \cdots & B_{1m} \\ B_{21} & B_{22} & \cdots & B_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ B_{m1} & B_{m2} & \cdots & B_{mm} \\ \end{pmatrix},\quad\mathbf{C}=\mathbf{A}\times\mathbf{B}=\begin{pmatrix} C_{11} & C_{12} & \cdots & C_{1m} \\ C_{21} & C_{22} & \cdots & C_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ C_{m1} & C_{m2} & \cdots & C_{mm} \\ \end{pmatrix}</math> <math>C_{ij}=\sum_{k=1}^m A_{ik}B_{kj}\,</math> To compute each element in {{math\|'''C'''}} takes {{math\|''m''}} multiplications and {{math\|(''m'' - ''1'')}} additions. Therefore, with a CPU implementation, the time complexity to achieve the computation is Θ(''n''<sup href="Category:GPGPU">3</sup>) ==== Discrete Cosine Transform ====

Multidimensional DSP with GPU acceleration: Difference between revisions