General-purpose computing on graphics processing units: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Added work. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 165/492
no longer true
 
(31 intermediate revisions by 11 users not shown)
Line 1:
{{Short description|Use of a GPU for computations typically assigned to CPUs}}
 
{{Use dmy dates|date=January 2015}}
{{More citations needed|date=February 2022}}
 
'''General-purpose computing on graphics processing units''' ('''GPGPU''', or less often '''GPGP''') is the use of a [[graphics processing unit]] (GPU), which typically handles computation only for [[computer graphics]], to perform computation in applications traditionally handled by the [[central processing unit]] (CPU).<ref>{{Cite conference |last1=Fung |first1=James |last2=Tang |first2=Felix |last3=Mann |first3=Steve |date=7–10 October 2002 |title=Mediated Reality Using Computer Graphics Hardware for Computer Vision |url=http://www.eyetap.org/papers/docs/iswc02-fung.pdf |conference=Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002) |___location=Seattle, Washington, USA |pages=83–89 |archive-url=https://web.archive.org/web/20120402173637/http://www.eyetap.org/~fungja/glorbits_final.pdf |archive-date=2 April 2012}}</ref><ref name="Aimone">{{cite journal | url=https://link.springer.com/article/10.1007/s00779-003-0239-6 | doi=10.1007/s00779-003-0239-6 | title=An Eye ''Tap'' video-based featureless projective motion estimation assisted by gyroscopic tracking for wearable computer mediated reality | year=2003 | last1=Aimone | first1=Chris | last2=Fung | first2=James | last3=Mann | first3=Steve | journal=Personal and Ubiquitous Computing | volume=7 | issue=5 | pages=236–248 | s2cid=25168728 | url-access=subscription }}</ref><ref>[http://www.eyetap.org/papers/docs/procicassp2004.pdf "Computer Vision Signal Processing on Graphics Processing Units", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004)] {{webarchive|url=https://web.archive.org/web/20110819000326/http://www.eyetap.org/papers/docs/procicassp2004.pdf |date=19 August 2011 }}: Montreal, Quebec, Canada, 17–21 May 2004, pp. V-93 – V-96</ref><ref>Chitty, D. M. (2007, July). [https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf A data parallel approach to genetic programming using programmable graphics hardware] {{webarchive|url=https://web.archive.org/web/20170808190114/https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf |date=8 August 2017 }}. In Proceedings of the 9th annual conference on Genetic and evolutionary computation (pp. 1566-1573). ACM.</ref> The use of multiple [[video card]]s in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.<ref>[http://eyetap.org/papers/docs/procicpr2004.pdf "Using Multiple Graphics Cards as a General Purpose Parallel Computer: Applications to Computer Vision", Proceedings of the 17th International Conference on Pattern Recognition (ICPR2004)] {{webarchive|url=https://web.archive.org/web/20110718193841/http://eyetap.org/papers/docs/procicpr2004.pdf |date=18 July 2011 }}, Cambridge, United Kingdom, 23–26 August 2004, volume 1, pages 805–808.</ref>
 
Essentially, a GPGPU [[graphics pipeline|pipeline]] is a kind of [[Parallel computing|parallel processing]] between one or more GPUs and CPUs, thatwith analyzesspecial dataaccelerated as if itinstructions werefor inprocessing image or other graphic formforms of data. While GPUs operate at lower frequencies, they typically have many times the number of [[Multi-coreSingle processorinstruction, multiple threads|coresProcessing elements]]. Thus, GPUs can process far more pictures and other graphical data per second than a traditional CPU. Migrating data into graphicalparallel form and then using the GPU to scan and analyzeprocess it can (theoretically) create a large [[speedup]].
 
GPGPU pipelines were developed at the beginning of the 21st century for [[graphics processing]] (e.g. for better [[shader]]s). TheseFrom pipelinesthe were[[history foundof tosupercomputing]] fitit is well-known that [[scientific computing]] needsdrives well,the andlargest haveconcentrations sinceof beenComputing developedpower in thishistory, directionlisted in the [[TOP500]]: the majority today utilize [[GPU]]s.
 
The best-known GPGPUs are [[Nvidia Tesla]] that are used for [[Nvidia DGX]], alongside [[AMD Instinct]] and Intel Gaudi.
Line 15 ⟶ 16:
In principle, any arbitrary [[Boolean function]], including addition, multiplication, and other mathematical functions, can be built up from a [[functional completeness|functionally complete]] set of logic operators. In 1987, [[Conway's Game of Life]] became one of the first examples of general-purpose computing using an early [[stream processing|stream processor]] called a [[blitter]] to invoke a special sequence of [[bit blit|logical operations]] on bit vectors.<ref>{{cite journal|last=Hull|first=Gerald|title=LIFE|journal=Amazing Computing|volume=2|issue=12|pages=81–84|date=December 1987|url=https://archive.org/stream/amazing-computing-magazine-1987-12/Amazing_Computing_Vol_02_12_1987_Dec#page/n81/mode/2up}}</ref>
 
General-purpose computing on GPUs became more practical and popular after about 2001, with the advent of both programmable [[shader]]s and [[floating point]] support on graphics processors. Notably, problems involving [[matrix (mathematics)|matrices]] and/or [[vector (mathematics and physics)|vector]]s{{snd}} especially two-, three-, or four-dimensional vectors{{snd}} were easy to translate to a GPU, which acts with native speed and support on those types. A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs.<ref>{{Cite journal |last1=Krüger |first1=Jens |last2=Westermann |first2=Rüdiger |date=July 2003 |title=Linear algebra operators for GPU implementation of numerical algorithms |url=https://dl.acm.org/doi/10.1145/882262.882363 |journal=ACM Transactions on Graphics |language=en |volume=22 |issue=3 |pages=908–916 |doi=10.1145/882262.882363 |issn=0730-0301|url-access=subscription }}</ref><ref>{{Cite journal |last1=Bolz |first1=Jeff |last2=Farmer |first2=Ian |last3=Grinspun |first3=Eitan |last4=Schröder |first4=Peter |date=July 2003 |title=Sparse matrix solvers on the GPU: conjugate gradients and multigrid |url=https://dl.acm.org/doi/10.1145/882262.882364 |journal=ACM Transactions on Graphics |language=en |volume=22 |issue=3 |pages=917–924 |doi=10.1145/882262.882364 |issn=0730-0301|url-access=subscription }}</ref> These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives, as supported by the two major APIs for graphics processors, [[OpenGL]] and [[DirectX]]. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such as [[Lib Sh|Sh]]/[[RapidMind]], [[BrookGPU|Brook]] and Accelerator.<ref>{{cite journal |last1=Tarditi |first1=David |first2=Sidd |last2=Puri |first3=Jose |last3=Oglesby |title=Accelerator: using data parallelism to program GPUs for general-purpose uses |journal=ACM SIGARCH Computer Architecture News |volume=34 |issue=5 |date=2006|url=https://www.cs.cmu.edu/afs/cs/academic/class/15740-f07/public/discussion-papers/26-tarditi-asplos06.pdf|doi=10.1145/1168919.1168898 }}</ref><ref>{{cite journal |last1=Che |first1=Shuai |last2=Boyer |first2=Michael |last3=Meng |first3=Jiayuan |last4=Tarjan |first4=D. |last5=Sheaffer |first5=Jeremy W. |last6=Skadron |first6=Kevin |title=A performance study of general-purpose applications on graphics processors using CUDA |journal=J. Parallel and Distributed Computing |volume=68 |issue=10 |date=2008 |pages=1370–1380 |doi=10.1016/j.jpdc.2008.05.014 |df=dmy-all |citeseerx=10.1.1.143.4849 }}</ref><ref>{{cite journal |last1=Glaser |first1=J. |last2=Nguyen |first2=T. D. |last3=Anderson |first3=J. A. |last4=Lui |first4=P. |last5=Spiga |first5=F. |last6=Millan |first6=J. A. |last7=Morse |first7=D. C. |last8=Glotzer |first8=S. C. |date=2015 |title=Strong scaling of general-purpose molecular dynamics simulations on GPUs |journal=Computer Physics Communications |volume=192 |pages=97–107 | doi=10.1016/j.cpc.2015.02.028|arxiv=1412.3387 |bibcode=2015CoPhC.192...97G | doi-access=free}}</ref>
 
These were followed by Nvidia's [[CUDA]], which allowed programmers to ignore the underlying graphical concepts in favor of more common [[high-performance computing]] concepts.<ref name="du">{{Cite journal |doi= 10.1016/j.parco.2011.10.002 |title= From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming |journal= Parallel Computing |volume= 38 |issue= 8 |pages= 391–407 |year= 2012 |last1= Du |first1= Peng |last2= Weber |first2= Rick |last3= Luszczek |first3= Piotr |last4= Tomov |first4= Stanimire |last5= Peterson |first5= Gregory |last6= Dongarra |first6= Jack |author-link6= Jack Dongarra |df= dmy-all |citeseerx= 10.1.1.193.7712 }}</ref> Newer, hardware-vendor-independent offerings include Microsoft's [[DirectCompute]] and Apple/Khronos Group's [[OpenCL]].<ref name="du"/> This means that modern GPGPU pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form.
Line 22 ⟶ 23:
 
==Implementations==
===Software libraries and APIs===
Any language that allows the code running on the CPU to poll a GPU [[shader]] for return values, can create a GPGPU framework. Programming standards for parallel computing include [[OpenCL]] (vendor-independent), [[OpenACC]], [[OpenMP]] and [[OpenHMPP]].
 
Line 68 ⟶ 70:
 
===Vectorization===
{{See also|Vector_processor#GPU_vector_processing_features|SIMD|SWAR|Single instruction, multiple threads{{!}}SIMT}}
{{Unreferenced section|date=July 2017}}
Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once.{{Disputed inline|date=July 2025}} For example, if one color {{angbr|R1, G1, B1}} is to be modulated by another color {{angbr|R2, G2, B2}}, the GPU can produce the resulting color {{angbr|R1*R2, G1*G2, B1*B2}} in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional).{{citation needed|date=July 2017}} Examples include vertices, colors, normal vectors, and texture coordinates. Many other applications can put this to good use, and because of their higher performance, vector instructions, termed single instruction, multiple data ([[Single instruction, multiple data|SIMD]]), have long been available on CPUs.{{citation needed|date=July 2017}}
 
==GPU vs. CPU==
Line 82 ⟶ 85:
A simple example would be a GPU program that collects data about average [[lighting]] values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use [[edge detection]] to return both numerical information and a processed image representing outlines to a [[computer vision]] program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every [[pixel]] or other picture element in an image, it can analyze and average it (for the first example) or apply a [[Sobel operator|Sobel edge filter]] or other [[convolution]] filter (for the second) with much greater speed than a CPU, which typically must access slower [[random-access memory]] copies of the graphic in question.
 
GPGPU is fundamentallyas a software concept, not a hardware concept; it is a type of [[algorithm]], not a piece of equipment. Specialized equipment designs may, however, even further enhance the efficiency of GPGPU pipelines, which traditionally perform relatively few algorithms on very large amounts of data. Massively parallelized, gigantic-data-level tasks thus may be parallelized even further via specialized setups such as rack computing (many similar, highly tailored machines built into a ''rack''), which adds a third layer{{snd}} many computing units each using many CPUs to correspond to many GPUs. Some [[Bitcoin]] "miners" used such setups for high-quantity processing. Insights into the largest such systems in the world has been maintained at the [[TOP500]] supercomputer list.
 
===Caches===
Line 89 ⟶ 92:
===Register file===
GPUs have very large [[Register file|register files]], which allow them to reduce context-switching latency. Register file size is also increasing over different GPU generations, e.g., the total register file size on Maxwell (GM200), Pascal and Volta GPUs are 6&nbsp;MiB, 14&nbsp;MiB and 20&nbsp;MiB, respectively.<ref>"[https://devblogs.nvidia.com/parallelforall/inside-pascal/ Inside Pascal: Nvidia’s Newest Computing Platform] {{webarchive|url=https://web.archive.org/web/20170507110037/https://devblogs.nvidia.com/parallelforall/inside-pascal/ |date=7 May 2017 }}"</ref><ref>"[https://devblogs.nvidia.com/inside-volta/ Inside Volta: The World’s Most Advanced Data Center GPU] {{webarchive|url=https://web.archive.org/web/20200101171030/https://devblogs.nvidia.com/inside-volta/ |date=1 January 2020 }}"</ref> By comparison, the size of a [[Processor register|register file on CPUs]] is small, typically tens or hundreds of kilobytes.
 
In essence: almost all GPU workloads are inherently massively-parallel LOAD-COMPUTE-STORE in nature, such as [[Tiled rendering]]. Even storing one temporary vector for further recall (LOAD-COMPUTE-STORE-COMPUTE-LOAD-COMPUTE-STORE) is so expensive due to the [[Random-access_memory#Memory_wall|Memory wall]] problem that it is to be avoided at all costs.<ref>{{cite book | last1=Li | first1=Jie | last2=Michelogiannakis | first2=George | last3=Cook | first3=Brandon | last4=Cooray | first4=Dulanya | last5=Chen | first5=Yong | title=High Performance Computing | chapter=Analyzing Resource Utilization in an HPC System: A Case Study of NERSC's Perlmutter | series=Lecture Notes in Computer Science | date=2023 | volume=13948 | pages=297–316 | doi=10.1007/978-3-031-32041-5_16 | isbn=978-3-031-32040-8 | chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-32041-5_16 }}</ref> The result is that register file size ''has'' to increase. In standard CPUs it is possible to introduce [[Cache (computing)|caches]] (a [[D-cache]]) to solve this problem, however these are relativrly so large that they are impractical to introduce in GPUs which would need one per Processing Element. [[ILLIAC IV]] innovatively solved the problem around 1967 by introducing a local memory per Processing Element (a PEM): a strategy copied by the [[Flynn%27s_taxonomy#Associative_processor|Aspex ASP]].
 
===Energy efficiency===
Line 163 ⟶ 168:
 
====Flow control====
For accurate technical information on this topic see [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication]] and ILLIAC IV [[ILLIAC IV#Branches|"branching"]] (the term "predicate mask" did not exist in 1967).
 
In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs.<ref name="book">{{cite web|url=https://developer.nvidia.com/gpugems/GPUGems2/gpugems2_chapter34.html|title=GPU Gems – Chapter 34, GPU Flow-Control Idioms}}</ref><!--not really, branching could be zeroed out even on NV20, which gives roughly the same result--> Conditional writes could be performed using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible.
 
Line 218 ⟶ 225:
* [[Level set methods]]
* [[Computed tomography|CT]] reconstruction<ref>Jimenez, Edward S., and Laurel J. Orr. "[https://www.osti.gov/servlets/purl/1106909 Rethinking the union of computed tomography reconstruction and GPGPU computing]." Penetrating Radiation Systems and Applications XIV. Vol. 8854. International Society for Optics and Photonics, 2013.</ref>
* [[Fast Fourier transform]]<ref>{{Cite journal |url=https://www.researchgate.net/publication/5462925 |doi=10.1109/TMI.2007.909834 |title=Accelerating the Nonequispaced Fast Fourier Transform on Commodity Graphics Hardware |year=2008 |last1=Sorensen |first1=T.S. |last2=Schaeffter |first2=T. |last3=Noe |first3=K.O. |last4=Hansen |first4=M.S. |journal=IEEE Transactions on Medical Imaging |volume=27 |issue=4 |pages=538–547 |pmid=18390350 |bibcode=2008ITMI...27..538S |s2cid=206747049 }}</ref>
* GPU learning{{snd}} [[machine learning]] and [[data mining]] computations, e.g., with software BIDMach
* [[k-nearest neighbor algorithm]]<ref>{{cite arXiv | eprint=0804.1448 | last1=Garcia | first1=Vincent | last2=Debreuve | first2=Eric | last3=Barlaud | first3=Michel | title=Fast k Nearest Neighbor Search using GPU | year=2008 | class=cs.CV }}</ref>
Line 265 ⟶ 272:
* [[Number theory]]
** [[Primality test]]ing and [[integer factorization]]<ref>{{cite web|url=https://mersenne.org/various/works.php|title=How GIMPS Works|work=Great Internet Mersenne Prime Search|access-date=6 March 2025}}</ref>
* [[Bioinformatics]]<ref>{{cite journal|doi=10.1186/1471-2105-8-474|pmid=18070356|pmc=2222658|title=High-throughput sequence alignment using Graphics Processing Units|journal=BMC Bioinformatics|volume=8|pagesarticle-number=474|year=2007|last1=Schatz|first1=Michael C|last2=Trapnell|first2=Cole|last3=Delcher|first3=Arthur L|last4=Varshney|first4=Amitabh |doi-access=free }}</ref><ref name=Manavski2008>{{cite journal |author=Svetlin A. Manavski |author2=Giorgio Valle |title=CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment |journal=BMC Bioinformatics |volume=9 |issue=Suppl. 2 |page=S10 |date=2008 |doi=10.1186/1471-2105-9-s2-s10 |pmid=18387198 |pmc=2323659 |df=dmy-all |doi-access=free }}</ref>
* [[Medical imaging]]
* [[Clinical decision support system]] (CDSS)<ref>{{cite journal|last1=Olejnik|first1=M|last2=Steuwer|first2=M|last3=Gorlatch|first3=S|last4=Heider|first4=D|title=gCUP: rapid GPU-based HIV-1 co-receptor usage prediction for next-generation sequencing.|journal=Bioinformatics|date=15 November 2014|volume=30|issue=22|pages=3272–3|pmid=25123901|doi=10.1093/bioinformatics/btu535|doi-access=free}}</ref>
Line 394 ⟶ 401:
** [[Advanced Simulation Library]]
** [[Physics processing unit]] (PPU)
* {{Annotated link|Vector processor}}
* {{Annotated link|Single instruction, multiple threads}}
 
==References==
Line 399 ⟶ 408:
 
== Further reading ==
* {{Cite journal |last1=Owens |first1=J.D. |last2=Houston |first2=M. |last3=Luebke |first3=D. |last4=Green |first4=S. |last5=Stone |first5=J.E. |last6=Phillips |first6=J.C. |date=May 2008 |title=GPU Computing |url=https://ieeexplore.ieee.org/document/4490127 |journal=Proceedings of the IEEE |volume=96 |issue=5 |pages=879–899 |doi=10.1109/JPROC.2008.917757 |s2cid=17091128 |issn=0018-9219}}
* {{Cite journal |last1=Brodtkorb |first1=André R. |last2=Hagen |first2=Trond R. |last3=Sætra |first3=Martin L. |date=2013-01-01 |title=Graphics processing unit (GPU) programming strategies and trends in GPU computing |url=https://www.sciencedirect.com/science/article/pii/S0743731512000998 |journal=Journal of Parallel and Distributed Computing |series=Metaheuristics on GPUs |volume=73 |issue=1 |pages=4–13 |doi=10.1016/j.jpdc.2012.04.003 |issn=0743-7315|hdl=10852/40283 |hdl-access=free }}