Multidimensional DSP with GPU acceleration: Difference between revisions

Content deleted Content added
Arraycom (talk | contribs)
Monkbot (talk | contribs)
m Task 18 (cosmetic): eval 17 templates: del empty params (5×); hyphenate params (6×);
Line 27:
[[File:SIMD GPGPU.jpg|alt= Figure illustrating a SIMD/vector computation unit in GPGPUs..|thumb|GPGPU/SIMD computation model.]]
 
Modern GPU designs are mainly based on the [[SIMD]] (Single Instruction Multiple Data) computation paradigm.<ref>{{cite journal|title = NVIDIA Tesla: A Unified Graphics and Computing Architecture|journal = IEEE Micro|date = 2008-03-01|issn = 0272-1732|pages = 39–55|volume = 28|issue = 2|doi = 10.1109/MM.2008.31|first = E.|last = Lindholm|first2 = J.|last2 = Nickolls|first3 = S.|last3 = Oberman|first4 = J.|last4 = Montrym}}</ref><ref>{{cite book|title = Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)|last = Kim|first = Hyesoon|publisher = Morgan & Claypool Publishers|year = 2012|isbn = 978-1-60845-954-4|___location = |pages = |last2 = Vuduc|first2 = Richard|last3 = Baghsorkhi|first3 = Sara|last4 = Choi|first4 = Jee|last5 = Hwu|first5 = Wen-Mei W.|editor-last = Hill|editor-first = Mark D.|doi = 10.2200/S00451ED1V01Y201209CAC020}}</ref> This type of GPU devices is so-called [[General-purpose computing on graphics processing units|general-purpose GPUs (GPGPUs)]].
 
GPGPUs are able to perform an operation on multiple independent data concurrently with their vector or SIMD functional units. A modern GPGPU can spawn thousands of concurrent threads and process all threads in a batch manner. With this nature, GPGPUs can be employed as DSP accelerators easily while many DSP problems can be solved by [[Divide and conquer algorithms|divide-and-conquer]] algorithms. A large scale and complex DSP problem can be divided into a bunch of small numeric problems and be processed altogether at one time so that the overall time complexity can be reduced significantly. For example, multiplying two {{math|''M'' × ''M''}} matrices can be processed by {{math|''M'' × ''M''}} concurrent threads on a GPGPU device without any output data dependency. Therefore, theoretically, by means of GPGPU acceleration, we can gain up to {{math|''M'' × ''M''}} speedup compared with a traditional CPU or digital signal processor.
Line 35:
 
=== CUDA ===
[[CUDA]] is the standard interface to program [[Nvidia|NVIDIA]] GPUs. NVIDIA also provides many CUDA libraries to support DSP acceleration on NVIDIA GPU devices.<ref>{{cite web|title = Parallel Programming and Computing Platform {{!}} CUDA {{!}} NVIDIA {{!}} NVIDIA|url = http://www.nvidia.com/object/cuda_home_new.html|website = www.nvidia.com|accessdateaccess-date = 2015-11-05|archive-url = https://web.archive.org/web/20140106051908/http://www.nvidia.com/object/cuda_home_new.html|archive-date = 2014-01-06|url-status = dead}}</ref>
 
=== OpenCL ===
[[OpenCL]] is an industrial standard which was originally proposed by [[Apple Inc.]] and is maintained and developed by the [[Khronos Group]] now.<ref>{{cite web|title = OpenCL – The open standard for parallel programming of heterogeneous systems|url = https://www.khronos.org/opencl/|website = www.khronos.org|accessdateaccess-date = 2015-11-05}}</ref> OpenCL provides [[C++]] like [[Application programming interface|APIs]] for programming different devices universally, including GPGPUs.
[[File:OpenCL execution flow rev.jpg|alt=Illustrating the execution flow of an OpenCL program/kernel|thumb|474x474px|OpenCL program execution flow]]
The following figure illustrates the execution flow of launching an OpenCL program on a GPU device. The CPU first detects OpenCL devices (GPU in this case) and than invokes a just-in-time compiler to translate the OpenCL source code into target binary. CPU then sends data to GPU to perform computations. When the GPU is processing data, CPU is free to process its own tasks.
 
=== C++ AMP ===
[[C++ AMP]] is a programming model proposed by [[Microsoft]]. C++ AMP is a [[C++]] based library designed for programming SIMD processors<ref>{{cite web|title = C++ AMP (C++ Accelerated Massive Parallelism)|url = https://msdn.microsoft.com/en-us/library/hh265137.aspx|website = msdn.microsoft.com|accessdateaccess-date = 2015-11-05}}</ref>
 
=== OpenACC ===
[[OpenACC]] is a programming standard for parallel computing developed by [[Cray]], CAPS, [[Nvidia|NVIDIA]] and PGI.<ref>{{cite web|title = OpenACC Home {{!}} www.openacc.org|url = http://www.openacc.org/|website = www.openacc.org|accessdateaccess-date = 2015-11-05}}</ref> OpenAcc targets programming for CPU and GPU heterogeneous systems with [[C (programming language)|C]], [[C++]], and [[Fortran]] extensions.
 
== Examples of GPU programming for multidimensional DSP ==
Line 158:
<math>X(\Omega_1,\Omega_2,...,\Omega_s)=\sum_{n_1=0}^{m_1-1}\sum_{n_2=0}^{m_2-1}...\sum_{n_s=0}^{m_s-1}x(n_1, n_2,...,n_s)e^{-j(\Omega_1n_1+\Omega_1n_1+...+\Omega_sn_s)}</math>
 
Practically, to implement an M-D DTFT, we can perform M times 1-D DFTF and matrix transpose with respect to each dimension. With a 1-D DTFT operation, GPGPU can conceptually reduce the complexity from ''Θ(n''<sup href="Category:GPGPU">''2''</sup>'')'' to Θ(n'')'' as illustrated by the following example of OpenCL implementation''.'' That is, an M-D DTFT the complexity of GPGPU can be computed on a GPU with a complexity of ''Θ(n''<sup href="Category:GPGPU">''2''</sup>'').'' While some GPGPUs are also equipped with hardware FFT accelerators internally, this implementation might be also optimized by invoking the FFT APIs or libraries provided by GPU manufacture.<ref>{{cite web|title = OpenCL™ Optimization Case Study Fast Fourier Transform – Part II – AMD|url = http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-fast-fourier-transform-part-ii/|website = AMD|accessdateaccess-date = 2015-11-05|language = en-US}}</ref><syntaxhighlight lang="c++" line="1">
// DTFT in OpenCL
__kernel void convolution(
Line 181:
 
=== Digital filter design ===
Designing a multidimensional digital filter is a big challenge, especially [[Infinite impulse response|IIR]] filters. Typically it relies on computers to solve difference equations and obtain a set of approximated solutions. While GPGPU computation is becoming popular, several adaptive algorithms have been proposed to design multidimensional [[Finite impulse response|FIR]] and/or [[Infinite impulse response|IIR]] filters by means of GPGPUs.<ref>{{cite book|title = GPU-efficient Recursive Filtering and Summed-area Tables|publisher = ACM|journal = Proceedings of the 2011 SIGGRAPH Asia Conference|date = 2011-01-01|___location = New York, NY, USA|isbn = 978-1-4503-0807-6|pages = 176:1–176:12|series = SA '11|doi = 10.1145/2024156.2024210|first = Diego|last = Nehab|first2 = André|last2 = Maximo|first3 = Rodolfo S.|last3 = Lima|first4 = Hugues|last4 = Hoppe}}</ref><ref>{{cite book|title = GPU Gems 2: Programming Techniques For High-Performance Graphics And General-Purpose Computation|last = Pharr|first = Matt|publisher = Pearson Addison Wesley|year = 2005|isbn = 978-0-321-33559-3|___location = |pages = |last2 = Fernando|first2 = Randima}}</ref><ref>{{cite book|title = GPU Computing Gems Emerald Edition|last = Hwu|first = Wen-mei W.|publisher = Morgan Kaufmann Publishers Inc.|year = 2011|isbn = 978-0-12-385963-1|___location = San Francisco, CA, USA|pages = }}</ref>
 
=== Radar signal reconstruction and analysis ===
Line 190:
 
=== Medical image processing ===
In order to have accurate diagnosis, 2-D or 3-D medical signals, such as [[ultrasound]], [[X-ray]], [[Magnetic resonance imaging|MRI]], and [[CT scan|CT]], often require very high sampling rate and image resolutions to reconstruct images. By applying GPGPUs' superior computation power, it was shown that we can acquire better-quality medical images<ref>{{cite web|title = Medical Imaging{{!}}NVIDIA|url = http://www.nvidia.com/object/medical_imaging.html|website = www.nvidia.com|accessdateaccess-date = 2015-11-07}}</ref><ref>{{cite book|title = GPU-based Volume Rendering for Medical Image Visualization|journal = Engineering in Medicine and Biology Society, 2005. IEEE-EMBS 2005. 27th Annual International Conference of the|volume = 5|date = 2005-01-01|pages = 5145–5148|doi = 10.1109/IEMBS.2005.1615635|pmid = 17281405|first = Yang|last = Heng|first2 = Lixu|last2 = Gu|isbn = 978-0-7803-8741-6}}</ref>
 
== References ==