Content deleted Content added
→Vectorization: taking out the reference to SIMD, it's more nuanced than that. Tags: Mobile edit Mobile web edit Advanced mobile edit |
KylieTastic (talk | contribs) no longer true |
||
(26 intermediate revisions by 9 users not shown) | |||
Line 1:
{{Short description|Use of a GPU for computations typically assigned to CPUs}}
{{Use dmy dates|date=January 2015}}
{{More citations needed|date=February 2022}}
Line 5 ⟶ 6:
'''General-purpose computing on graphics processing units''' ('''GPGPU''', or less often '''GPGP''') is the use of a [[graphics processing unit]] (GPU), which typically handles computation only for [[computer graphics]], to perform computation in applications traditionally handled by the [[central processing unit]] (CPU).<ref>{{Cite conference |last1=Fung |first1=James |last2=Tang |first2=Felix |last3=Mann |first3=Steve |date=7–10 October 2002 |title=Mediated Reality Using Computer Graphics Hardware for Computer Vision |url=http://www.eyetap.org/papers/docs/iswc02-fung.pdf |conference=Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002) |___location=Seattle, Washington, USA |pages=83–89 |archive-url=https://web.archive.org/web/20120402173637/http://www.eyetap.org/~fungja/glorbits_final.pdf |archive-date=2 April 2012}}</ref><ref name="Aimone">{{cite journal | url=https://link.springer.com/article/10.1007/s00779-003-0239-6 | doi=10.1007/s00779-003-0239-6 | title=An Eye ''Tap'' video-based featureless projective motion estimation assisted by gyroscopic tracking for wearable computer mediated reality | year=2003 | last1=Aimone | first1=Chris | last2=Fung | first2=James | last3=Mann | first3=Steve | journal=Personal and Ubiquitous Computing | volume=7 | issue=5 | pages=236–248 | s2cid=25168728 | url-access=subscription }}</ref><ref>[http://www.eyetap.org/papers/docs/procicassp2004.pdf "Computer Vision Signal Processing on Graphics Processing Units", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004)] {{webarchive|url=https://web.archive.org/web/20110819000326/http://www.eyetap.org/papers/docs/procicassp2004.pdf |date=19 August 2011 }}: Montreal, Quebec, Canada, 17–21 May 2004, pp. V-93 – V-96</ref><ref>Chitty, D. M. (2007, July). [https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf A data parallel approach to genetic programming using programmable graphics hardware] {{webarchive|url=https://web.archive.org/web/20170808190114/https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf |date=8 August 2017 }}. In Proceedings of the 9th annual conference on Genetic and evolutionary computation (pp. 1566-1573). ACM.</ref> The use of multiple [[video card]]s in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.<ref>[http://eyetap.org/papers/docs/procicpr2004.pdf "Using Multiple Graphics Cards as a General Purpose Parallel Computer: Applications to Computer Vision", Proceedings of the 17th International Conference on Pattern Recognition (ICPR2004)] {{webarchive|url=https://web.archive.org/web/20110718193841/http://eyetap.org/papers/docs/procicpr2004.pdf |date=18 July 2011 }}, Cambridge, United Kingdom, 23–26 August 2004, volume 1, pages 805–808.</ref>
Essentially, a GPGPU [[graphics pipeline|pipeline]] is a kind of [[Parallel computing|parallel processing]] between one or more GPUs and CPUs,
GPGPU pipelines were developed at the beginning of the 21st century for [[graphics processing]] (e.g. for better [[shader]]s).
The best-known GPGPUs are [[Nvidia Tesla]] that are used for [[Nvidia DGX]], alongside [[AMD Instinct]] and Intel Gaudi.
Line 22 ⟶ 23:
==Implementations==
===Software libraries and APIs===
Any language that allows the code running on the CPU to poll a GPU [[shader]] for return values, can create a GPGPU framework. Programming standards for parallel computing include [[OpenCL]] (vendor-independent), [[OpenACC]], [[OpenMP]] and [[OpenHMPP]].
Line 68 ⟶ 70:
===Vectorization===
{{See also|Vector_processor#GPU_vector_processing_features|SIMD|SWAR|Single instruction, multiple threads{{!}}SIMT}}
{{Unreferenced section|date=July 2017}}
Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once.{{Disputed inline|date=July 2025}} For example, if one color {{angbr|R1, G1, B1}} is to be modulated by another color {{angbr|R2, G2, B2}}, the GPU can produce the resulting color {{angbr|R1*R2, G1*G2, B1*B2}} in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional).{{citation needed|date=July 2017}} Examples include vertices, colors, normal vectors, and texture coordinates.
Line 83 ⟶ 85:
A simple example would be a GPU program that collects data about average [[lighting]] values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use [[edge detection]] to return both numerical information and a processed image representing outlines to a [[computer vision]] program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every [[pixel]] or other picture element in an image, it can analyze and average it (for the first example) or apply a [[Sobel operator|Sobel edge filter]] or other [[convolution]] filter (for the second) with much greater speed than a CPU, which typically must access slower [[random-access memory]] copies of the graphic in question.
GPGPU
===Caches===
Line 90 ⟶ 92:
===Register file===
GPUs have very large [[Register file|register files]], which allow them to reduce context-switching latency. Register file size is also increasing over different GPU generations, e.g., the total register file size on Maxwell (GM200), Pascal and Volta GPUs are 6 MiB, 14 MiB and 20 MiB, respectively.<ref>"[https://devblogs.nvidia.com/parallelforall/inside-pascal/ Inside Pascal: Nvidia’s Newest Computing Platform] {{webarchive|url=https://web.archive.org/web/20170507110037/https://devblogs.nvidia.com/parallelforall/inside-pascal/ |date=7 May 2017 }}"</ref><ref>"[https://devblogs.nvidia.com/inside-volta/ Inside Volta: The World’s Most Advanced Data Center GPU] {{webarchive|url=https://web.archive.org/web/20200101171030/https://devblogs.nvidia.com/inside-volta/ |date=1 January 2020 }}"</ref> By comparison, the size of a [[Processor register|register file on CPUs]] is small, typically tens or hundreds of kilobytes.
In essence: almost all GPU workloads are inherently massively-parallel LOAD-COMPUTE-STORE in nature, such as [[Tiled rendering]]. Even storing one temporary vector for further recall (LOAD-COMPUTE-STORE-COMPUTE-LOAD-COMPUTE-STORE) is so expensive due to the [[Random-access_memory#Memory_wall|Memory wall]] problem that it is to be avoided at all costs.<ref>{{cite book | last1=Li | first1=Jie | last2=Michelogiannakis | first2=George | last3=Cook | first3=Brandon | last4=Cooray | first4=Dulanya | last5=Chen | first5=Yong | title=High Performance Computing | chapter=Analyzing Resource Utilization in an HPC System: A Case Study of NERSC's Perlmutter | series=Lecture Notes in Computer Science | date=2023 | volume=13948 | pages=297–316 | doi=10.1007/978-3-031-32041-5_16 | isbn=978-3-031-32040-8 | chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-32041-5_16 }}</ref> The result is that register file size ''has'' to increase. In standard CPUs it is possible to introduce [[Cache (computing)|caches]] (a [[D-cache]]) to solve this problem, however these are relativrly so large that they are impractical to introduce in GPUs which would need one per Processing Element. [[ILLIAC IV]] innovatively solved the problem around 1967 by introducing a local memory per Processing Element (a PEM): a strategy copied by the [[Flynn%27s_taxonomy#Associative_processor|Aspex ASP]].
===Energy efficiency===
Line 164 ⟶ 168:
====Flow control====
For accurate technical information on this topic see [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication]] and ILLIAC IV [[ILLIAC IV#Branches|"branching"]] (the term "predicate mask" did not exist in 1967).
In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs.<ref name="book">{{cite web|url=https://developer.nvidia.com/gpugems/GPUGems2/gpugems2_chapter34.html|title=GPU Gems – Chapter 34, GPU Flow-Control Idioms}}</ref><!--not really, branching could be zeroed out even on NV20, which gives roughly the same result--> Conditional writes could be performed using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible.
Line 395 ⟶ 401:
** [[Advanced Simulation Library]]
** [[Physics processing unit]] (PPU)
* {{Annotated link|Vector processor}}
* {{Annotated link|Single instruction, multiple threads}}
==References==
|