Multidimensional DSP with GPU acceleration: Difference between revisions

Content deleted Content added
Sing0512 (talk | contribs)
Sing0512 (talk | contribs)
Line 70:
\end{pmatrix},\quad C_{ij}=\sum_{k=1}^m A_{ik}B_{kj}</math>
 
To compute each element in {{math|'''C'''}} takes {{math|''m''}} multiplications and {{math|(''m'' – ''1'')}} additions. Therefore, with a CPU implementation, the time complexity to achieve this computation is ''Θ(n''<sup href="Category:GPGPU">''3''</sup>'')'' in the following C example''.'' However, we have known that elements in {{math|'''C'''}} are independent toof each other. Hence, the computation can be fully parallelized by SIMD processors, such as GPGPU devices. With a GPGPU implementation, the time complexity significantly reduces to ''Θ(n)'' by unrolling the for-loop showingas shown in the following OpenCL example''.''<syntaxhighlight lang="c" line="1">
// MxM matrix multiplication in C
void matrixMul(
Line 115:
 
=== Multidimensional Convolution (M-D Convolution) ===
Convolution is a frequently used operation in DSP. To compute the 2-D convolution of two ''m'' × ''m'' signals, it requires {{math|''m''<sup>''2''</sup>}} multiplications and {{math|''m'' × (''m'' – ''1'')}} additions for an output element. That is, the overall time complexity is ''Θ(n''<sup href="Category:GPGPU">''4''</sup>'')'' for the entire output signal''.'' As the following OpenCL example shows, with GPGPU acceleration, the total computation time effectively decreases to ''Θ(n''<sup href="Category:GPGPU">''2''</sup>'')'' since all output elements are data independent.
 
2-D convolution equation:
 
<math>y(n_1, n_2)=x(n_1,n_2)**h(n_1,n_2)=\sum_{k_1=0}^{m-1}\sum_{k_2=0}^{m-1}x(k_1, k_2)h(n_1-k_1, n_2-k_2)</math><syntaxhighlight lang="c++" line="1">
Line 144 ⟶ 146:
 
Note that, although the example demonstrated above is a 2-D convolution, a similar approach can be adopted for a higher dimension system. Overall, for a s-D convolution, a GPGPU implementation has time complexity ''Θ(n''<sup href="Category:GPGPU">''s''</sup>'')'', whereas a CPU implementation has time complexity ''Θ(n''<sup href="Category:GPGPU">''2s''</sup>'')''.
 
M-D convolution equation:
 
<math>y(n_1,n_2,...,n_s)=x(n_1,n_2,...,n_s)**h(n_1,n_2,...,n_s)=\sum_{k_1=0}^{m_1-1}\sum_{k_2=0}^{m_2-1}...\sum_{k_s=0}^{m_s-1}x(k_1, k_2,...,k_s)h(n_1-k_1,n_2-k_2,...,n_s-k_s)</math>
 
=== Multidimensional Discrete Time Fourier Transform (M-D DTFT) ===
In addition to convolution, the [[Fourier transform|discrete -time Fourier transform (DTFT)]] is another technique which is often used in system analysis.
 
<math>X(\Omega_1,\Omega_2,...,\Omega_s)=\sum_{n_1=0}^{m_1-1}\sum_{n_2=0}^{m_2-1}...\sum_{n_s=0}^{m_s-1}x(n_1, n_2,...,n_s)e^{-j(\Omega_1n_1+\Omega_1n_1+...+\Omega_sn_s)}</math>
 
Practically, to implement an M-D DTFT, if assuming the system is separable, we can perform M times 1-D DFTF and matrix transpose with respect to each dimension. With a 1-D FFTDTFT operation, GPGPU can conceptually reduce the complexity from ''Θ(n''<sup href="Category:GPGPU">''2''</sup>'')'' to Θ(n'')'' as illustrated by the following example of OpenCL implementation''.'' That is, to performan M-D DTFT, the complexity of GPGPU can be achievedcomputed byon a GPU with a complexity of ''Θ(n''<sup href="Category:GPGPU">''2''</sup>'').'' While some GPGPUs' are also equipped with hardware FFT acceleratoraccelerators internally, this implementation might be also optimized by invoking the FFT APIs or libraries provided by GPU manufacturesmanufacture.<ref>{{cite web|title = OpenCL™ Optimization Case Study Fast Fourier Transform – Part II – AMD|url = http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-fast-fourier-transform-part-ii/|website = AMD|accessdate = 2015-11-05|language = en-US}}</ref><syntaxhighlight lang="c++" line="1">
// DTFT in OpenCL
__kernel void convolution(