Multidimensional DSP with GPU acceleration: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 15:58, 9 November 2023 edit Eatingbugs (talk \| contribs) Extended confirmed users 3,970 edits mNo edit summary Tag: Visual edit: Switched ← Previous edit		Latest revision as of 23:40, 26 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,863,356 edits Added bibcode. Removed URL that duplicated identifier. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 825/990
(7 intermediate revisions by 6 users not shown)
Line 1: {{Orphan\|date=November 2015}} Multidimensional Digital Signal Processing (MDSP) refers to the extension of [[Digital signal processing]] (DSP) techniques to signals that vary in more than one dimension. While conventional DSP typically deals with one-dimensional data, such as time-varying [[Audio signal\|audio signals]], MDSP involves processing signals in two or more dimensions. Many of the principles from one-dimensional DSP, such as [[Fourier transform\|Fourier transforms]] and [[filter design]], have analogous counterparts in multidimensional signal processing. ~~{{advert\|date=March 2017}}~~ Modern [[general-purpose computing on graphics processing units]] (GPGPUs) have an excellent throughput on vector operations and numeric manipulations through a high degree of parallel computations. Processing digital signals, particularly multidimensional signals, often involves a series of vector operations on massive numbers of independent data samples, GPGPUs are now widely employed to accelerate multidimensional DSP, such as [[image processing]], [[Video processing\|video codecs]], [[Radar signal characteristics\|radar signal analysis]], [[sonar signal processing]], and [[ultrasound scan]]ning. Conceptually, GPGPUs dramatically reduce the computation complexity when compared with [[~~Cpu\|~~central processing ~~units~~unit]]s (CPUs)]], [[~~Digital~~digital signal processor]]s (DSPs), or other [[Field-programmable gate array\|FPGA]] accelerators.▼ [[Digital signal processing]] (DSP) is a ubiquitous methodology in scientific and engineering computations. In practice, DSP problems are often not only one dimensional. For instance, image data is a 2-D signal and radar is a 3-D signal. While the number of dimensions increases, the time and/or storage complexity of processing digital signals grow dramatically. Therefore, solving multidimensional DSP problems in real-time is extremely difficult. ▲Modern [[general-purpose computing on graphics processing units]] (GPGPUs) have an excellent throughput on vector operations and numeric manipulations through a high degree of parallel computations. Processing digital signals, particularly multidimensional signals, often involves a series of vector operations on massive numbers of independent data samples, GPGPUs are now widely employed to accelerate multidimensional DSP, such as [[image processing]], [[Video processing\|video codecs]], [[Radar signal characteristics\|radar signal analysis]], [[sonar signal processing]], and [[ultrasound scan]]ning. Conceptually, GPGPUs dramatically reduce the computation complexity when compared with [[Cpu\|central processing units (CPUs)]], [[Digital signal processor]]s (DSPs), or other [[Field-programmable gate array\|FPGA]] accelerators. ==Motivation== Line 25 ⟶ 23: ==GPGPU computations== [[File:SIMD GPGPU.jpg\|alt= Figure illustrating a SIMD/vector computation unit in GPGPUs..\|thumb\|GPGPU/SIMD computation model.]] Modern GPU designs are mainly based on the [[Single instruction, multiple data\|SIMD]] (Single Instruction Multiple Data) computation paradigm.<ref>{{cite journal\|title=NVIDIA Tesla: A Unified Graphics and Computing Architecture\|journal=IEEE Micro\|date=2008-03-01\|issn=0272-1732\|pages=39–55\|volume=28\|issue=2\|doi=10.1109/MM.2008.31\|first1=E.\|last1=Lindholm\|first2=J.\|last2=Nickolls\|first3=S.\|last3=Oberman\|first4=J.\|last4=Montrym\|~~s2cid~~bibcode=~~2793450\|url=https://ieeexplore~~2008IMicr.~~ieee~~.~~org/document/4523358~~28b..39L \|~~url-access~~s2cid=~~registration~~2793450\|language=en}}</ref><ref>{{cite book\|title=Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)\|last1=Kim\|first1=Hyesoon\|author1-link=Hyesoon Kim\|publisher=Morgan & Claypool Publishers\|year=2012\|isbn=978-1-60845-954-4\|last2=Vuduc\|first2=Richard\|last3=Baghsorkhi\|first3=Sara\|last4=Choi\|first4=Jee\|last5=Hwu\|first5=Wen-Mei W.\|editor-last=Hill\|editor-first=Mark D.\|doi=10.2200/S00451ED1V01Y201209CAC020\|language=en}}</ref> This type of GPU devices is so-called [[General-purpose computing on graphics processing units\|general-purpose GPUs (GPGPUs)]]. GPGPUs are able to perform an operation on multiple independent data concurrently with their vector or SIMD functional units. A modern GPGPU can spawn thousands of concurrent threads and process all threads in a batch manner. With this nature, GPGPUs can be employed as DSP accelerators easily while many DSP problems can be solved by [[Divide and conquer algorithms\|divide-and-conquer]] algorithms. A large scale and complex DSP problem can be divided into a bunch of small numeric problems and be processed altogether at one time so that the overall time complexity can be reduced significantly. For example, multiplying two {{math\|''M'' × ''M''}} matrices can be processed by {{math\|''M'' × ''M''}} concurrent threads on a GPGPU device without any output data dependency. Therefore, theoretically, by means of GPGPU acceleration, we can gain up to {{math\|''M'' × ''M''}} speedup compared with a traditional CPU or digital signal processor. Line 40 ⟶ 38: [[OpenCL]] is an industrial standard which was originally proposed by [[Apple Inc.]] and is maintained and developed by the [[Khronos Group]] now.<ref>{{cite web\|title=OpenCL – The open standard for parallel programming of heterogeneous systems\|url=https://www.khronos.org/opencl/\|website=www.khronos.org\|date=21 July 2013\|access-date=2015-11-05\|language=en}}</ref> OpenCL provides [[C++]] like [[Application programming interface\|APIs]] for programming different devices universally, including GPGPUs. [[File:OpenCL execution flow rev.jpg\|alt=Illustrating the execution flow of an OpenCL program/kernel\|thumb\|474x474px\|OpenCL program execution flow]] The following figure illustrates the execution flow of launching an OpenCL program on a GPU device. The CPU first detects OpenCL devices (GPU in this case) and ~~than~~then invokes a just-in-time compiler to translate the OpenCL source code into target binary. CPU then sends data to GPU to perform computations. When the GPU is processing data, CPU is free to process its own tasks. ===C++ AMP=== Line 72 ⟶ 70: \end{pmatrix},\quad C_{ij}=\sum_{k=1}^m A_{ik}B_{kj}</math> To compute each element in {{math\|'''C'''}} takes {{math\|''m''}} multiplications and {{math\|(''m'' – ''1'')}} additions. Therefore, with a CPU implementation, the time complexity to achieve this computation is ''Θ(n''<sup ~~href="Category:GPGPU"~~>''3''</sup>'')'' in the following C example''.'' However, we have known that elements in {{math\|'''C'''}} are independent of each other. Hence, the computation can be fully parallelized by SIMD processors, such as GPGPU devices. With a GPGPU implementation, the time complexity significantly reduces to ''Θ(n)'' by unrolling the for-loop as shown in the following OpenCL example''.''<syntaxhighlight lang="c" line="1"> // MxM matrix multiplication in C void matrixMul( Line 117 ⟶ 115: ===Multidimensional convolution=== Convolution is a frequently used operation in DSP. To compute the 2-D convolution of two ''m'' × ''m'' signals, it requires {{math\|''m''<sup>''2''</sup>}} multiplications and {{math\|''m'' × (''m'' – ''1'')}} additions for an output element. That is, the overall time complexity is ''Θ(n''<sup ~~href="Category:GPGPU"~~>''4''</sup>'')'' for the entire output signal''.'' As the following OpenCL example shows, with GPGPU acceleration, the total computation time effectively decreases to ''Θ(n''<sup ~~href="Category:GPGPU"~~>''2''</sup>'')'' since all output elements are data independent. 2-D convolution equation: Line 147 ⟶ 145: </syntaxhighlight> Note that, although the example demonstrated above is a 2-D convolution, a similar approach can be adopted for a higher dimension system. Overall, for a s-D convolution, a GPGPU implementation has time complexity ''Θ(n''<sup ~~href="Category:GPGPU"~~>''s''</sup>'')'', whereas a CPU implementation has time complexity ''Θ(n''<sup ~~href="Category:GPGPU"~~>''2s''</sup>'')''. M-D convolution equation: Line 158 ⟶ 156: <math>X(\Omega_1,\Omega_2,...,\Omega_s)=\sum_{n_1=0}^{m_1-1}\sum_{n_2=0}^{m_2-1}...\sum_{n_s=0}^{m_s-1}x(n_1, n_2,...,n_s)e^{-j(\Omega_1n_1+\Omega_1n_1+...+\Omega_sn_s)}</math> Practically, to implement an M-D DTFT, we can perform M times 1-D DFTF and matrix transpose with respect to each dimension. With a 1-D DTFT operation, GPGPU can conceptually reduce the complexity from ''Θ(n''<sup ~~href="Category:GPGPU"~~>''2''</sup>'')'' to Θ(n'')'' as illustrated by the following example of OpenCL implementation''.'' That is, an M-D DTFT the complexity of GPGPU can be computed on a GPU with a complexity of ''Θ(n''<sup ~~href="Category:GPGPU"~~>''2''</sup>'').'' While some GPGPUs are also equipped with hardware FFT accelerators internally, this implementation might be also optimized by invoking the FFT APIs or libraries provided by GPU manufacture.<ref>{{cite web\|title=OpenCL™ Optimization Case Study Fast Fourier Transform – Part II – AMD\|url=http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-fast-fourier-transform-part-ii/\|website=AMD\|access-date=2015-11-05\|language=en-US}}</ref><syntaxhighlight lang="c++" line="1"> // DTFT in OpenCL __kernel void convolution( Line 184 ⟶ 182: ===Radar signal reconstruction and analysis=== Radar systems usually need to reconstruct numerous 3-D or 4-D data samples in real-time. Traditionally, particularly in military, this needs supercomputers' support. Nowadays, GPGPUs are also employed to replace supercomputers to process radar signals. For example, to process [[Synthetic aperture radar\|synthetic aperture radar (SAR)]] signals, it usually involves multidimensional [[Fast Fourier transform\|FFT]] computations.<ref>{{cite book\|date=2009-10-01\|pages=309–314\|doi=10.1109/SIPS.2009.5336272\|first1=C.\|last1=Clemente\|first2=M.\|last2=Di Bisceglie\|first3=M.\|last3=Di Santo\|first4=N.\|last4=Ranaldo\|first5=M.\|last5=Spinelli\|title=2009 IEEE Workshop on Signal Processing Systems\|chapter=Processing of synthetic Aperture Radar data with GPGPU\|isbn=978-1-4244-4335-2\|s2cid=18932083~~\|url=https://ieeexplore.ieee.org/document/5336272\|url-access=limited~~\|language=en}}</ref><ref>{{cite book\|date=2009-10-01\|pages=1–5\|doi=10.1109/CISP.2009.5304418\|first1=Bin\|last1=Liu\|first2=Kaizhi\|last2=Wang\|first3=Xingzhao\|last3=Liu\|first4=Wenxian\|last4=Yu\|title=2009 2nd International Congress on Image and Signal Processing\|chapter=An Efficient SAR Processor Based on GPU via CUDA\|isbn=978-1-4244-4129-7\|s2cid=18801932}}</ref><ref>{{cite book\|date=2014-06-01\|pages=455–458\|doi=10.1109/MIXDES.2014.6872240\|first1=P.\|last1=Monsurro\|first2=A.\|last2=Trifiletti\|first3=F.\|last3=Lannutti\|title=2014 Proceedings of the 21st International Conference Mixed Design of Integrated Circuits and Systems (MIXDES)\|chapter=Implementing radar algorithms on CUDA hardware\|isbn=978-83-63578-05-3\|s2cid=16482715}}</ref> GPGPUs can be used to rapidly perform FFT and/or iFFT in this kind of applications. ===Self-driving cars===