General-purpose computing on graphics processing units: Difference between revisions

Content deleted Content added
Implementations: all the "implementations" are actually software libraries and APIs
Tags: Mobile edit Mobile web edit Advanced mobile edit
no longer true
 
(22 intermediate revisions by 9 users not shown)
Line 1:
{{Short description|Use of a GPU for computations typically assigned to CPUs}}
 
{{Use dmy dates|date=January 2015}}
{{More citations needed|date=February 2022}}
Line 5 ⟶ 6:
'''General-purpose computing on graphics processing units''' ('''GPGPU''', or less often '''GPGP''') is the use of a [[graphics processing unit]] (GPU), which typically handles computation only for [[computer graphics]], to perform computation in applications traditionally handled by the [[central processing unit]] (CPU).<ref>{{Cite conference |last1=Fung |first1=James |last2=Tang |first2=Felix |last3=Mann |first3=Steve |date=7–10 October 2002 |title=Mediated Reality Using Computer Graphics Hardware for Computer Vision |url=http://www.eyetap.org/papers/docs/iswc02-fung.pdf |conference=Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002) |___location=Seattle, Washington, USA |pages=83–89 |archive-url=https://web.archive.org/web/20120402173637/http://www.eyetap.org/~fungja/glorbits_final.pdf |archive-date=2 April 2012}}</ref><ref name="Aimone">{{cite journal | url=https://link.springer.com/article/10.1007/s00779-003-0239-6 | doi=10.1007/s00779-003-0239-6 | title=An Eye ''Tap'' video-based featureless projective motion estimation assisted by gyroscopic tracking for wearable computer mediated reality | year=2003 | last1=Aimone | first1=Chris | last2=Fung | first2=James | last3=Mann | first3=Steve | journal=Personal and Ubiquitous Computing | volume=7 | issue=5 | pages=236–248 | s2cid=25168728 | url-access=subscription }}</ref><ref>[http://www.eyetap.org/papers/docs/procicassp2004.pdf "Computer Vision Signal Processing on Graphics Processing Units", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004)] {{webarchive|url=https://web.archive.org/web/20110819000326/http://www.eyetap.org/papers/docs/procicassp2004.pdf |date=19 August 2011 }}: Montreal, Quebec, Canada, 17–21 May 2004, pp. V-93 – V-96</ref><ref>Chitty, D. M. (2007, July). [https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf A data parallel approach to genetic programming using programmable graphics hardware] {{webarchive|url=https://web.archive.org/web/20170808190114/https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf |date=8 August 2017 }}. In Proceedings of the 9th annual conference on Genetic and evolutionary computation (pp. 1566-1573). ACM.</ref> The use of multiple [[video card]]s in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.<ref>[http://eyetap.org/papers/docs/procicpr2004.pdf "Using Multiple Graphics Cards as a General Purpose Parallel Computer: Applications to Computer Vision", Proceedings of the 17th International Conference on Pattern Recognition (ICPR2004)] {{webarchive|url=https://web.archive.org/web/20110718193841/http://eyetap.org/papers/docs/procicpr2004.pdf |date=18 July 2011 }}, Cambridge, United Kingdom, 23–26 August 2004, volume 1, pages 805–808.</ref>
 
Essentially, a GPGPU [[graphics pipeline|pipeline]] is a kind of [[Parallel computing|parallel processing]] between one or more GPUs and CPUs, thatwith analyzesspecial dataaccelerated as if itinstructions werefor inprocessing image or other graphic formforms of data. While GPUs operate at lower frequencies, they typically have many times the number of [[Multi-coreSingle processorinstruction, multiple threads|coresProcessing elements]]. Thus, GPUs can process far more pictures and other graphical data per second than a traditional CPU. Migrating data into graphicalparallel form and then using the GPU to scan and analyzeprocess it can (theoretically) create a large [[speedup]].
 
GPGPU pipelines were developed at the beginning of the 21st century for [[graphics processing]] (e.g. for better [[shader]]s). TheseFrom pipelinesthe were[[history foundof tosupercomputing]] fitit is well-known that [[scientific computing]] needsdrives well,the andlargest haveconcentrations sinceof beenComputing developedpower in thishistory, directionlisted in the [[TOP500]]: the majority today utilize [[GPU]]s.
 
The best-known GPGPUs are [[Nvidia Tesla]] that are used for [[Nvidia DGX]], alongside [[AMD Instinct]] and Intel Gaudi.
Line 69 ⟶ 70:
 
===Vectorization===
{{See also|Vector_processor#GPU_vector_processing_features|SIMD|SWAR|Single instruction, multiple threads{{!}}SIMT}}
{{Unreferenced section|date=July 2017}}
Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once.{{Disputed inline|date=July 2025}} For example, if one color {{angbr|R1, G1, B1}} is to be modulated by another color {{angbr|R2, G2, B2}}, the GPU can produce the resulting color {{angbr|R1*R2, G1*G2, B1*B2}} in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional).{{citation needed|date=July 2017}} Examples include vertices, colors, normal vectors, and texture coordinates.
Line 84 ⟶ 85:
A simple example would be a GPU program that collects data about average [[lighting]] values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use [[edge detection]] to return both numerical information and a processed image representing outlines to a [[computer vision]] program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every [[pixel]] or other picture element in an image, it can analyze and average it (for the first example) or apply a [[Sobel operator|Sobel edge filter]] or other [[convolution]] filter (for the second) with much greater speed than a CPU, which typically must access slower [[random-access memory]] copies of the graphic in question.
 
GPGPU is fundamentallyas a software concept, not a hardware concept; it is a type of [[algorithm]], not a piece of equipment. Specialized equipment designs may, however, even further enhance the efficiency of GPGPU pipelines, which traditionally perform relatively few algorithms on very large amounts of data. Massively parallelized, gigantic-data-level tasks thus may be parallelized even further via specialized setups such as rack computing (many similar, highly tailored machines built into a ''rack''), which adds a third layer{{snd}} many computing units each using many CPUs to correspond to many GPUs. Some [[Bitcoin]] "miners" used such setups for high-quantity processing. Insights into the largest such systems in the world has been maintained at the [[TOP500]] supercomputer list.
 
===Caches===
Line 92 ⟶ 93:
GPUs have very large [[Register file|register files]], which allow them to reduce context-switching latency. Register file size is also increasing over different GPU generations, e.g., the total register file size on Maxwell (GM200), Pascal and Volta GPUs are 6&nbsp;MiB, 14&nbsp;MiB and 20&nbsp;MiB, respectively.<ref>"[https://devblogs.nvidia.com/parallelforall/inside-pascal/ Inside Pascal: Nvidia’s Newest Computing Platform] {{webarchive|url=https://web.archive.org/web/20170507110037/https://devblogs.nvidia.com/parallelforall/inside-pascal/ |date=7 May 2017 }}"</ref><ref>"[https://devblogs.nvidia.com/inside-volta/ Inside Volta: The World’s Most Advanced Data Center GPU] {{webarchive|url=https://web.archive.org/web/20200101171030/https://devblogs.nvidia.com/inside-volta/ |date=1 January 2020 }}"</ref> By comparison, the size of a [[Processor register|register file on CPUs]] is small, typically tens or hundreds of kilobytes.
 
In essence: almost all GPU workloads are inherently massively-parallel LOAD-COMPUTE-STORE in nature, such as [[Tiled rendering]]. Even storing one temporary vector for further recall (LOAD-COMPUTE-STORE-COMPUTE-LOAD-COMPUTE-STORE) is so expensive due to the [[Random-access_memory#Memory_wall|Memory wall]] problem that it is to be avoided at all costs.<ref>{{cite book | last1=Li | first1=Jie | last2=Michelogiannakis | first2=George | last3=Cook | first3=Brandon | last4=Cooray | first4=Dulanya | last5=Chen | first5=Yong | title=High Performance Computing | chapter=Analyzing Resource Utilization in an HPC System: A Case Study of NERSC's Perlmutter | series=Lecture Notes in Computer Science | date=2023 | volume=13948 | pages=297–316 | doi=10.1007/978-3-031-32041-5_16 | isbn=978-3-031-32040-8 | chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-32041-5_16 }}</ref> The result is that register file size ''has'' to increase. In standard CPUs it is possible to introduce [[Cache (computing)|caches]] (a [[D-cache]]) to solve this problem, however these are relativrly so large that they are impractical to introduce in GPUs which would need one per Processing Element. [[ILLIAC IV]] innovatively solved the problem around 1967 by introducing a local memory per Processing Element (a PEM): a strategy copied by the [[Flynn%27s_taxonomy#Associative_processor|Aspex ASP]].
 
===Energy efficiency===
Line 167 ⟶ 168:
 
====Flow control====
For accurate technical information on this topic see [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication]] and ILLIAC IV [[ILLIAC IV#Branches|"branching"]] (the term "predicate mask" did not exist in 1967).
 
In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs.<ref name="book">{{cite web|url=https://developer.nvidia.com/gpugems/GPUGems2/gpugems2_chapter34.html|title=GPU Gems – Chapter 34, GPU Flow-Control Idioms}}</ref><!--not really, branching could be zeroed out even on NV20, which gives roughly the same result--> Conditional writes could be performed using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible.