Multidimensional DSP with GPU acceleration

Digital signal processing (DSP) is an ubiquitous methodology in scientific and engineering computations. However, practically, to DSP problems are often not only 1-D. For instance, image data are 2-D signals and radar signals are 3-D signals. While the number of dimensions increases, the time and/or storage complexity of processing digital signal grows dramatically. Therefore, solving DSP problems in real-time is extremely difficult in reality.

Modern general purpose graphics processing units (GPGPUs) are considered having excellent throughput on vector operations and numeric manipulations by high degree of parallel computation. While processing digital signals, particularly multidimensional signals, often involves in a series of vector operations on massive amount of independent data samples, GPGPUs are now widely employed to accelerate multidimensional DSP, such as image processing, video codec, radar signal analysis, sonar signal processing, and ultrasound scanning. Conceptually, using GPGPU devices to perform multidimensional DSP is able to dramatically reduce the computation complexity compared with central processing units (CPUs), digital signal processors (DSPs), or other FPGA accelerators.

Motivation

Processing multidimensional signals is a common problem in scientific researches and/or engineering computations. Notwithstanding, with its high degree of time and storage complexity, it is extremely difficult to process multidimensional signals in real-time. In general, the computation complexity of multidimensional DSP grows exponentially with the number of dimensions. Therefore, it is still hard to obtain the computation results with digital signal processors (DSPs). Hence, a better solution of software algorithm or hardware architecture to accelerate multidimensional DSP computations is strongly required.

Existing Approaches

Practically, to accelerate multidimensional DSP, some common approaches have been proposed and developed in the past decades.

Lower Sampling Rate

Using a lower sampling rate can efficiently reduce the number of samples to be processed at one time and thereby decreasing the computation complexity. However, this can lead to the aliasing problem in the sampling theorem and make a poor quality of outputs. In some applications, such as military radars, we are eager to have highly precise and accurate results. In such cases, using a lower sampling rate in multidimensional DSP is not allowable.

Digital Signal Processors (DSPs)

Digital signal processors are designed specifically to process vector operations. They are widely used in DSP computations. However, most digital signal processors are only capable of manipulating two operations in parallel. This kind of designs is sufficient to accelerate audio processing (1-D signals) and image processing (2-D signals). However, with a large amount of data samples in multidimensional signals, this is still not efficient enough to retrieve computation results in real-time.

Adopting Supercomputers

In order to accelerate multidimensional DSP computations, using dedicated supercomputers or cluster computers is required in some situations, e.g., weather forecasting. However, using supercomputers designated to simply perform DSP operations takes considerable cost and energy consumption. It is not suitable for all multidimensional DSP applications.

GPU Acceleration

GPUs are originally designed to accelerate image processing and video rendering. Moreover, since modern GPUs' have good ability to perform numeric computations in parallel with a relatively low cost and better energy efficiency, GPUs are becoming a popular alternative to replace supercomputers performing multidimensional DSP.

GPGPU Computations

Modern GPU designs are mainly based on SIMD computation paradigm. This type of GPU devices are so-called general-purpose GPUs (GPGPUs).

GPGPUs are able to perform an operation on multiple independent data concurrently with their vector or SIMD functional units. With this nature, GPGPUs can be employed as DSP accelerators easily while many DSP problems can be solved by divide-and-conquer algorithms. For example, multiplying two $M \times M$ matrices can be processed by $M \times M$ concurrent threads on a GPGPU device without any output data dependency. Therefore, theoretically, by means of GPGPU acceleration, we can gain up to $M \times M$ speedup compared with a traditional CPU or digital signal processor.

Programming Languages

Currently, there are multiple programming languages which support GPGPU programming.

CUDA

CUDA is the standard interface to program NVIDIA GPUs. NVIDIA also provides many CUDA libraries to support DSP acceleration on NVIDIA GPU devices.

OpenCL

OpenCL is an industrial standard which was originally proposed by Apple Inc. and is maintained and developed by Khronos Group now. OpenCL provides C++ like APIs for programming different devices universally, including GPGPUs.

Examples

Matrix Multiplication

Suppose $A$ and $B$ are two $m \times m$ matrices and we would like to compute $C = A \times B$ .

$\mathbf {A} ={\begin{pmatrix}A_{11}&A_{12}&\cdots &A_{1m}\\A_{21}&A_{22}&\cdots &A_{2m}\\\vdots &\vdots &\ddots &\vdots \\A_{m1}&A_{m2}&\cdots &A_{mm}\\\end{pmatrix}},\quad \mathbf {B} ={\begin{pmatrix}B_{11}&B_{12}&\cdots &B_{1m}\\B_{21}&B_{22}&\cdots &B_{2m}\\\vdots &\vdots &\ddots &\vdots \\B_{m1}&B_{m2}&\cdots &B_{mm}\\\end{pmatrix}}$

$\mathbf {C} =\mathbf {A} \times \mathbf {B} ={\begin{pmatrix}C_{11}&C_{12}&\cdots &C_{1m}\\C_{21}&C_{22}&\cdots &C_{2m}\\\vdots &\vdots &\ddots &\vdots \\C_{m1}&C_{m2}&\cdots &C_{mm}\\\end{pmatrix}},\quad C_{ij}=\sum _{k=1}^{m}A_{ik}B_{kj}$

To compute each element in $C$ takes $m$ multiplications and $(m - 1)$ additions. Therefore, with a CPU implementation, the time complexity to achieve this computation is Θ(n³) in the following C example. However, we have known that elements in $C$ are independent to each others. Hence, the computation can be fully parallelized by SIMD processors, such as GPGPU devices. With a GPGPU implementation, the time complexity reduces to Θ(n) in the following OpenCL example.

Multidimensional Convolution (M-D Convolution)

Convolution is a frequently used operation in DSP. To compute the convolution of two m × m signals, it requires $m 2$ multiplications and $m \times (m - 1)$ additions for an output element. That is, the overall time complexity is Θ(n⁴) for the entire output signal. As the following OpenCL example shows, with GPGPU acceleration, the total computation time effectively decreases to Θ(n²) since all output elements are data independent.

$y(n_{1},n_{2})=\sum _{k_{1}=0}^{m-1}\sum _{k_{2}=0}^{m-1}x(k_{1},k_{2})h(n_{1}-k_{1},n_{2}-k_{2})$

Multidimensional Fast Fourier Transform (M-D FFT)

In addition to convolution, fast Fourier transform (FFT) is another technique which is often used in signal analysis.

Real Applications

References

Category:Digital signal processing Category:Digital signal processors Category:GPGPU Category:Parallel computing