Content deleted Content added
Citation bot (talk | contribs) Added work. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 165/492 |
KylieTastic (talk | contribs) no longer true |
||
(31 intermediate revisions by 11 users not shown) | |||
Line 1:
{{Short description|Use of a GPU for computations typically assigned to CPUs}}
{{Use dmy dates|date=January 2015}}
{{More citations needed|date=February 2022}}
'''General-purpose computing on graphics processing units''' ('''GPGPU''', or less often '''GPGP''') is the use of a [[graphics processing unit]] (GPU), which typically handles computation only for [[computer graphics]], to perform computation in applications traditionally handled by the [[central processing unit]] (CPU).<ref>{{Cite conference |last1=Fung |first1=James |last2=Tang |first2=Felix |last3=Mann |first3=Steve |date=7–10 October 2002 |title=Mediated Reality Using Computer Graphics Hardware for Computer Vision |url=http://www.eyetap.org/papers/docs/iswc02-fung.pdf |conference=Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002) |___location=Seattle, Washington, USA |pages=83–89 |archive-url=https://web.archive.org/web/20120402173637/http://www.eyetap.org/~fungja/glorbits_final.pdf |archive-date=2 April 2012}}</ref><ref name="Aimone">{{cite journal | url=https://link.springer.com/article/10.1007/s00779-003-0239-6 | doi=10.1007/s00779-003-0239-6 | title=An Eye ''Tap'' video-based featureless projective motion estimation assisted by gyroscopic tracking for wearable computer mediated reality | year=2003 | last1=Aimone | first1=Chris | last2=Fung | first2=James | last3=Mann | first3=Steve | journal=Personal and Ubiquitous Computing | volume=7 | issue=5 | pages=236–248 | s2cid=25168728 | url-access=subscription }}</ref><ref>[http://www.eyetap.org/papers/docs/procicassp2004.pdf "Computer Vision Signal Processing on Graphics Processing Units", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004)] {{webarchive|url=https://web.archive.org/web/20110819000326/http://www.eyetap.org/papers/docs/procicassp2004.pdf |date=19 August 2011 }}: Montreal, Quebec, Canada, 17–21 May 2004, pp. V-93 – V-96</ref><ref>Chitty, D. M. (2007, July). [https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf A data parallel approach to genetic programming using programmable graphics hardware] {{webarchive|url=https://web.archive.org/web/20170808190114/https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf |date=8 August 2017 }}. In Proceedings of the 9th annual conference on Genetic and evolutionary computation (pp. 1566-1573). ACM.</ref> The use of multiple [[video card]]s in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.<ref>[http://eyetap.org/papers/docs/procicpr2004.pdf "Using Multiple Graphics Cards as a General Purpose Parallel Computer: Applications to Computer Vision", Proceedings of the 17th International Conference on Pattern Recognition (ICPR2004)] {{webarchive|url=https://web.archive.org/web/20110718193841/http://eyetap.org/papers/docs/procicpr2004.pdf |date=18 July 2011 }}, Cambridge, United Kingdom, 23–26 August 2004, volume 1, pages 805–808.</ref>
Essentially, a GPGPU [[graphics pipeline|pipeline]] is a kind of [[Parallel computing|parallel processing]] between one or more GPUs and CPUs,
GPGPU pipelines were developed at the beginning of the 21st century for [[graphics processing]] (e.g. for better [[shader]]s).
The best-known GPGPUs are [[Nvidia Tesla]] that are used for [[Nvidia DGX]], alongside [[AMD Instinct]] and Intel Gaudi.
Line 15 ⟶ 16:
In principle, any arbitrary [[Boolean function]], including addition, multiplication, and other mathematical functions, can be built up from a [[functional completeness|functionally complete]] set of logic operators. In 1987, [[Conway's Game of Life]] became one of the first examples of general-purpose computing using an early [[stream processing|stream processor]] called a [[blitter]] to invoke a special sequence of [[bit blit|logical operations]] on bit vectors.<ref>{{cite journal|last=Hull|first=Gerald|title=LIFE|journal=Amazing Computing|volume=2|issue=12|pages=81–84|date=December 1987|url=https://archive.org/stream/amazing-computing-magazine-1987-12/Amazing_Computing_Vol_02_12_1987_Dec#page/n81/mode/2up}}</ref>
General-purpose computing on GPUs became more practical and popular after about 2001, with the advent of both programmable [[shader]]s and [[floating point]] support on graphics processors. Notably, problems involving [[matrix (mathematics)|matrices]] and/or [[vector (mathematics and physics)|vector]]s{{snd}} especially two-, three-, or four-dimensional vectors{{snd}} were easy to translate to a GPU, which acts with native speed and support on those types. A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs.<ref>{{Cite journal |last1=Krüger |first1=Jens |last2=Westermann |first2=Rüdiger |date=July 2003 |title=Linear algebra operators for GPU implementation of numerical algorithms |url=https://dl.acm.org/doi/10.1145/882262.882363 |journal=ACM Transactions on Graphics |language=en |volume=22 |issue=3 |pages=908–916 |doi=10.1145/882262.882363 |issn=0730-0301|url-access=subscription }}</ref><ref>{{Cite journal |last1=Bolz |first1=Jeff |last2=Farmer |first2=Ian |last3=Grinspun |first3=Eitan |last4=Schröder |first4=Peter |date=July 2003 |title=Sparse matrix solvers on the GPU: conjugate gradients and multigrid |url=https://dl.acm.org/doi/10.1145/882262.882364 |journal=ACM Transactions on Graphics |language=en |volume=22 |issue=3 |pages=917–924 |doi=10.1145/882262.882364 |issn=0730-0301|url-access=subscription }}</ref> These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives, as supported by the two major APIs for graphics processors, [[OpenGL]] and [[DirectX]]. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such as [[Lib Sh|Sh]]/[[RapidMind]], [[BrookGPU|Brook]] and Accelerator.<ref>{{cite journal |last1=Tarditi |first1=David |first2=Sidd |last2=Puri |first3=Jose |last3=Oglesby |title=Accelerator: using data parallelism to program GPUs for general-purpose uses |journal=ACM SIGARCH Computer Architecture News |volume=34 |issue=5 |date=2006|url=https://www.cs.cmu.edu/afs/cs/academic/class/15740-f07/public/discussion-papers/26-tarditi-asplos06.pdf|doi=10.1145/1168919.1168898 }}</ref><ref>{{cite journal |last1=Che |first1=Shuai |last2=Boyer |first2=Michael |last3=Meng |first3=Jiayuan |last4=Tarjan |first4=D. |last5=Sheaffer |first5=Jeremy W. |last6=Skadron |first6=Kevin |title=A performance study of general-purpose applications on graphics processors using CUDA |journal=J. Parallel and Distributed Computing |volume=68 |issue=10 |date=2008 |pages=1370–1380 |doi=10.1016/j.jpdc.2008.05.014 |df=dmy-all |citeseerx=10.1.1.143.4849 }}</ref><ref>{{cite journal |last1=Glaser |first1=J. |last2=Nguyen |first2=T. D. |last3=Anderson |first3=J. A. |last4=Lui |first4=P. |last5=Spiga |first5=F. |last6=Millan |first6=J. A. |last7=Morse |first7=D. C. |last8=Glotzer |first8=S. C. |date=2015 |title=Strong scaling of general-purpose molecular dynamics simulations on GPUs |journal=Computer Physics Communications |volume=192 |pages=97–107 | doi=10.1016/j.cpc.2015.02.028|arxiv=1412.3387 |bibcode=2015CoPhC.192...97G | doi-access=free}}</ref>
These were followed by Nvidia's [[CUDA]], which allowed programmers to ignore the underlying graphical concepts in favor of more common [[high-performance computing]] concepts.<ref name="du">{{Cite journal |doi= 10.1016/j.parco.2011.10.002 |title= From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming |journal= Parallel Computing |volume= 38 |issue= 8 |pages= 391–407 |year= 2012 |last1= Du |first1= Peng |last2= Weber |first2= Rick |last3= Luszczek |first3= Piotr |last4= Tomov |first4= Stanimire |last5= Peterson |first5= Gregory |last6= Dongarra |first6= Jack |author-link6= Jack Dongarra |df= dmy-all |citeseerx= 10.1.1.193.7712 }}</ref> Newer, hardware-vendor-independent offerings include Microsoft's [[DirectCompute]] and Apple/Khronos Group's [[OpenCL]].<ref name="du"/> This means that modern GPGPU pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form.
Line 22 ⟶ 23:
==Implementations==
===Software libraries and APIs===
Any language that allows the code running on the CPU to poll a GPU [[shader]] for return values, can create a GPGPU framework. Programming standards for parallel computing include [[OpenCL]] (vendor-independent), [[OpenACC]], [[OpenMP]] and [[OpenHMPP]].
Line 68 ⟶ 70:
===Vectorization===
{{See also|Vector_processor#GPU_vector_processing_features|SIMD|SWAR|Single instruction, multiple threads{{!}}SIMT}}
{{Unreferenced section|date=July 2017}}
Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once.{{Disputed inline|date=July 2025}} For example, if one color {{angbr|R1, G1, B1}} is to be modulated by another color {{angbr|R2, G2, B2}}, the GPU can produce the resulting color {{angbr|R1*R2, G1*G2, B1*B2}} in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional).{{citation needed|date=July 2017}} Examples include vertices, colors, normal vectors, and texture coordinates.
==GPU vs. CPU==
Line 82 ⟶ 85:
A simple example would be a GPU program that collects data about average [[lighting]] values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use [[edge detection]] to return both numerical information and a processed image representing outlines to a [[computer vision]] program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every [[pixel]] or other picture element in an image, it can analyze and average it (for the first example) or apply a [[Sobel operator|Sobel edge filter]] or other [[convolution]] filter (for the second) with much greater speed than a CPU, which typically must access slower [[random-access memory]] copies of the graphic in question.
GPGPU
===Caches===
Line 89 ⟶ 92:
===Register file===
GPUs have very large [[Register file|register files]], which allow them to reduce context-switching latency. Register file size is also increasing over different GPU generations, e.g., the total register file size on Maxwell (GM200), Pascal and Volta GPUs are 6 MiB, 14 MiB and 20 MiB, respectively.<ref>"[https://devblogs.nvidia.com/parallelforall/inside-pascal/ Inside Pascal: Nvidia’s Newest Computing Platform] {{webarchive|url=https://web.archive.org/web/20170507110037/https://devblogs.nvidia.com/parallelforall/inside-pascal/ |date=7 May 2017 }}"</ref><ref>"[https://devblogs.nvidia.com/inside-volta/ Inside Volta: The World’s Most Advanced Data Center GPU] {{webarchive|url=https://web.archive.org/web/20200101171030/https://devblogs.nvidia.com/inside-volta/ |date=1 January 2020 }}"</ref> By comparison, the size of a [[Processor register|register file on CPUs]] is small, typically tens or hundreds of kilobytes.
In essence: almost all GPU workloads are inherently massively-parallel LOAD-COMPUTE-STORE in nature, such as [[Tiled rendering]]. Even storing one temporary vector for further recall (LOAD-COMPUTE-STORE-COMPUTE-LOAD-COMPUTE-STORE) is so expensive due to the [[Random-access_memory#Memory_wall|Memory wall]] problem that it is to be avoided at all costs.<ref>{{cite book | last1=Li | first1=Jie | last2=Michelogiannakis | first2=George | last3=Cook | first3=Brandon | last4=Cooray | first4=Dulanya | last5=Chen | first5=Yong | title=High Performance Computing | chapter=Analyzing Resource Utilization in an HPC System: A Case Study of NERSC's Perlmutter | series=Lecture Notes in Computer Science | date=2023 | volume=13948 | pages=297–316 | doi=10.1007/978-3-031-32041-5_16 | isbn=978-3-031-32040-8 | chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-32041-5_16 }}</ref> The result is that register file size ''has'' to increase. In standard CPUs it is possible to introduce [[Cache (computing)|caches]] (a [[D-cache]]) to solve this problem, however these are relativrly so large that they are impractical to introduce in GPUs which would need one per Processing Element. [[ILLIAC IV]] innovatively solved the problem around 1967 by introducing a local memory per Processing Element (a PEM): a strategy copied by the [[Flynn%27s_taxonomy#Associative_processor|Aspex ASP]].
===Energy efficiency===
Line 163 ⟶ 168:
====Flow control====
For accurate technical information on this topic see [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication]] and ILLIAC IV [[ILLIAC IV#Branches|"branching"]] (the term "predicate mask" did not exist in 1967).
In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs.<ref name="book">{{cite web|url=https://developer.nvidia.com/gpugems/GPUGems2/gpugems2_chapter34.html|title=GPU Gems – Chapter 34, GPU Flow-Control Idioms}}</ref><!--not really, branching could be zeroed out even on NV20, which gives roughly the same result--> Conditional writes could be performed using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible.
Line 218 ⟶ 225:
* [[Level set methods]]
* [[Computed tomography|CT]] reconstruction<ref>Jimenez, Edward S., and Laurel J. Orr. "[https://www.osti.gov/servlets/purl/1106909 Rethinking the union of computed tomography reconstruction and GPGPU computing]." Penetrating Radiation Systems and Applications XIV. Vol. 8854. International Society for Optics and Photonics, 2013.</ref>
* [[Fast Fourier transform]]<ref>{{Cite journal |url=https://www.researchgate.net/publication/5462925 |doi=10.1109/TMI.2007.909834 |title=Accelerating the Nonequispaced Fast Fourier Transform on Commodity Graphics Hardware |year=2008 |last1=Sorensen |first1=T.S. |last2=Schaeffter |first2=T. |last3=Noe |first3=K.O. |last4=Hansen |first4=M.S. |journal=IEEE Transactions on Medical Imaging |volume=27 |issue=4 |pages=538–547 |pmid=18390350 |bibcode=2008ITMI...27..538S |s2cid=206747049 }}</ref>
* GPU learning{{snd}} [[machine learning]] and [[data mining]] computations, e.g., with software BIDMach
* [[k-nearest neighbor algorithm]]<ref>{{cite arXiv | eprint=0804.1448 | last1=Garcia | first1=Vincent | last2=Debreuve | first2=Eric | last3=Barlaud | first3=Michel | title=Fast k Nearest Neighbor Search using GPU | year=2008 | class=cs.CV }}</ref>
Line 265 ⟶ 272:
* [[Number theory]]
** [[Primality test]]ing and [[integer factorization]]<ref>{{cite web|url=https://mersenne.org/various/works.php|title=How GIMPS Works|work=Great Internet Mersenne Prime Search|access-date=6 March 2025}}</ref>
* [[Bioinformatics]]<ref>{{cite journal|doi=10.1186/1471-2105-8-474|pmid=18070356|pmc=2222658|title=High-throughput sequence alignment using Graphics Processing Units|journal=BMC Bioinformatics|volume=8|
* [[Medical imaging]]
* [[Clinical decision support system]] (CDSS)<ref>{{cite journal|last1=Olejnik|first1=M|last2=Steuwer|first2=M|last3=Gorlatch|first3=S|last4=Heider|first4=D|title=gCUP: rapid GPU-based HIV-1 co-receptor usage prediction for next-generation sequencing.|journal=Bioinformatics|date=15 November 2014|volume=30|issue=22|pages=3272–3|pmid=25123901|doi=10.1093/bioinformatics/btu535|doi-access=free}}</ref>
Line 394 ⟶ 401:
** [[Advanced Simulation Library]]
** [[Physics processing unit]] (PPU)
* {{Annotated link|Vector processor}}
* {{Annotated link|Single instruction, multiple threads}}
==References==
Line 399 ⟶ 408:
== Further reading ==
* {{Cite journal |last1=Owens |first1=J.D. |last2=Houston |first2=M. |last3=Luebke |first3=D. |last4=Green |first4=S. |last5=Stone |first5=J.E. |last6=Phillips |first6=J.C. |date=May 2008 |title=GPU Computing
* {{Cite journal |last1=Brodtkorb |first1=André R. |last2=Hagen |first2=Trond R. |last3=Sætra |first3=Martin L. |date=2013-01-01 |title=Graphics processing unit (GPU) programming strategies and trends in GPU computing |url=https://www.sciencedirect.com/science/article/pii/S0743731512000998 |journal=Journal of Parallel and Distributed Computing |series=Metaheuristics on GPUs |volume=73 |issue=1 |pages=4–13 |doi=10.1016/j.jpdc.2012.04.003 |issn=0743-7315|hdl=10852/40283 |hdl-access=free }}
|