Content deleted Content added
m Reverted 1 edit by 2A01:4B00:AB0F:8600:9D06:E9C6:6EFA:C7C1 (talk) to last revision by Ixfd64 |
KylieTastic (talk | contribs) no longer true |
||
(35 intermediate revisions by 13 users not shown) | |||
Line 1:
{{Short description|Use of a GPU for computations typically assigned to CPUs}}
{{Use dmy dates|date=January 2015}}
{{More citations needed|date=February 2022}}
'''General-purpose computing on graphics processing units''' ('''GPGPU''', or less often '''GPGP''') is the use of a [[graphics processing unit]] (GPU), which typically handles computation only for [[computer graphics]], to perform computation in applications traditionally handled by the [[central processing unit]] (CPU).<ref>{{Cite conference |last1=Fung |first1=James |last2=Tang |first2=Felix |last3=Mann |first3=Steve |date=7–10 October 2002 |title=Mediated Reality Using Computer Graphics Hardware for Computer Vision |url=http://www.eyetap.org/papers/docs/iswc02-fung.pdf |conference=Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002) |___location=Seattle, Washington, USA |pages=83–89 |archive-url=https://web.archive.org/web/20120402173637/http://www.eyetap.org/~fungja/glorbits_final.pdf |archive-date=2 April 2012}}</ref><ref name="Aimone">{{cite journal | url=https://link.springer.com/article/10.1007/s00779-003-0239-6 | doi=10.1007/s00779-003-0239-6 | title=An Eye ''Tap'' video-based featureless projective motion estimation assisted by gyroscopic tracking for wearable computer mediated reality | year=2003 | last1=Aimone | first1=Chris | last2=Fung | first2=James | last3=Mann | first3=Steve | journal=Personal and Ubiquitous Computing | volume=7 | issue=5 | pages=236–248 | s2cid=25168728 | url-access=subscription }}</ref><ref>[http://www.eyetap.org/papers/docs/procicassp2004.pdf "Computer Vision Signal Processing on Graphics Processing Units", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004)] {{webarchive|url=https://web.archive.org/web/20110819000326/http://www.eyetap.org/papers/docs/procicassp2004.pdf |date=19 August 2011 }}: Montreal, Quebec, Canada, 17–21 May 2004, pp. V-93 – V-96</ref><ref>Chitty, D. M. (2007, July). [https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf A data parallel approach to genetic programming using programmable graphics hardware] {{webarchive|url=https://web.archive.org/web/20170808190114/https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf |date=8 August 2017 }}. In Proceedings of the 9th annual conference on Genetic and evolutionary computation (pp. 1566-1573). ACM.</ref> The use of multiple [[video card]]s in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.<ref>[http://eyetap.org/papers/docs/procicpr2004.pdf "Using Multiple Graphics Cards as a General Purpose Parallel Computer: Applications to Computer Vision", Proceedings of the 17th International Conference on Pattern Recognition (ICPR2004)] {{webarchive|url=https://web.archive.org/web/20110718193841/http://eyetap.org/papers/docs/procicpr2004.pdf |date=18 July 2011 }}, Cambridge, United Kingdom, 23–26 August 2004, volume 1, pages 805–808.</ref>
Essentially, a GPGPU [[graphics pipeline|pipeline]] is a kind of [[Parallel computing|parallel processing]] between one or more GPUs and CPUs,
GPGPU pipelines were developed at the beginning of the 21st century for [[graphics processing]] (e.g. for better [[shader]]s).
The
==History==
Line 15 ⟶ 16:
In principle, any arbitrary [[Boolean function]], including addition, multiplication, and other mathematical functions, can be built up from a [[functional completeness|functionally complete]] set of logic operators. In 1987, [[Conway's Game of Life]] became one of the first examples of general-purpose computing using an early [[stream processing|stream processor]] called a [[blitter]] to invoke a special sequence of [[bit blit|logical operations]] on bit vectors.<ref>{{cite journal|last=Hull|first=Gerald|title=LIFE|journal=Amazing Computing|volume=2|issue=12|pages=81–84|date=December 1987|url=https://archive.org/stream/amazing-computing-magazine-1987-12/Amazing_Computing_Vol_02_12_1987_Dec#page/n81/mode/2up}}</ref>
General-purpose computing on GPUs became more practical and popular after about 2001, with the advent of both programmable [[shader]]s and [[floating point]] support on graphics processors. Notably, problems involving [[matrix (mathematics)|matrices]] and/or [[vector (mathematics and physics)|vector]]s{{snd}} especially two-, three-, or four-dimensional vectors{{snd}} were easy to translate to a GPU, which acts with native speed and support on those types. A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs.<ref>{{Cite journal |last1=Krüger |first1=Jens |last2=Westermann |first2=Rüdiger |date=July 2003 |title=Linear algebra operators for GPU implementation of numerical algorithms |url=https://dl.acm.org/doi/10.1145/882262.882363 |journal=ACM Transactions on Graphics |language=en |volume=22 |issue=3 |pages=908–916 |doi=10.1145/882262.882363 |issn=0730-0301|url-access=subscription }}</ref><ref>{{Cite journal |last1=Bolz |first1=Jeff |last2=Farmer |first2=Ian |last3=Grinspun |first3=Eitan |last4=Schröder |first4=Peter |date=July 2003 |title=Sparse matrix solvers on the GPU: conjugate gradients and multigrid |url=https://dl.acm.org/doi/10.1145/882262.882364 |journal=ACM Transactions on Graphics |language=en |volume=22 |issue=3 |pages=917–924 |doi=10.1145/882262.882364 |issn=0730-0301|url-access=subscription }}</ref> These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives, as supported by the two major APIs for graphics processors, [[OpenGL]] and [[DirectX]]. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such as [[Lib Sh|Sh]]/[[RapidMind]], [[BrookGPU|Brook]] and Accelerator.<ref>{{cite journal |last1=Tarditi |first1=David |first2=Sidd |last2=Puri |first3=Jose |last3=Oglesby |title=Accelerator: using data parallelism to program GPUs for general-purpose uses |journal=ACM SIGARCH Computer Architecture News |volume=34 |issue=5 |date=2006|url=https://www.cs.cmu.edu/afs/cs/academic/class/15740-f07/public/discussion-papers/26-tarditi-asplos06.pdf|doi=10.1145/1168919.1168898 }}</ref><ref>{{cite journal |last1=Che |first1=Shuai |last2=Boyer |first2=Michael |last3=Meng |first3=Jiayuan |last4=Tarjan |first4=D. |last5=Sheaffer |first5=Jeremy W. |last6=Skadron |first6=Kevin |title=A performance study of general-purpose applications on graphics processors using CUDA |journal=J. Parallel and Distributed Computing |volume=68 |issue=10 |date=2008 |pages=1370–1380 |doi=10.1016/j.jpdc.2008.05.014 |df=dmy-all |citeseerx=10.1.1.143.4849 }}</ref><ref>{{cite journal |last1=Glaser |first1=J. |last2=Nguyen |first2=T. D. |last3=Anderson |first3=J. A. |last4=Lui |first4=P. |last5=Spiga |first5=F. |last6=Millan |first6=J. A. |last7=Morse |first7=D. C. |last8=Glotzer |first8=S. C. |date=2015 |title=Strong scaling of general-purpose molecular dynamics simulations on GPUs |journal=Computer Physics Communications |volume=192 |pages=97–107 | doi=10.1016/j.cpc.2015.02.028|arxiv=1412.3387 |bibcode=2015CoPhC.192...97G | doi-access=free}}</ref>
These were followed by Nvidia's [[CUDA]], which allowed programmers to ignore the underlying graphical concepts in favor of more common [[high-performance computing]] concepts.<ref name="du">{{Cite journal |doi= 10.1016/j.parco.2011.10.002 |title= From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming |journal= Parallel Computing |volume= 38 |issue= 8 |pages= 391–407 |year= 2012 |last1= Du |first1= Peng |last2= Weber |first2= Rick |last3= Luszczek |first3= Piotr |last4= Tomov |first4= Stanimire |last5= Peterson |first5= Gregory |last6= Dongarra |first6= Jack |author-link6= Jack Dongarra |df= dmy-all |citeseerx= 10.1.1.193.7712 }}</ref> Newer, hardware-vendor-independent offerings include Microsoft's [[DirectCompute]] and Apple/Khronos Group's [[OpenCL]].<ref name="du"/> This means that modern GPGPU pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form.
Line 22 ⟶ 23:
==Implementations==
===Software libraries and APIs===
Any language that allows the code running on the CPU to poll a GPU [[shader]] for return values, can create a GPGPU framework. Programming standards for parallel computing include [[OpenCL]] (vendor-independent), [[OpenACC]], [[OpenMP]] and [[OpenHMPP]].
Line 30 ⟶ 32:
[[ROCm]], launched in 2016, is AMD's open-source response to CUDA. It is, as of 2022, on par with CUDA with regards to features,{{source?|date=December 2024}} and still lacking in consumer support.{{source?|date=December 2024}}
OpenVIDIA was developed at [[University of Toronto]] between 2003–2005,<ref name="Fung">
Altimesh Hybridizer created by [[Altimesh]] compiles [[Common Intermediate Language]] to CUDA binaries.<ref>{{cite web|title=Hybridizer|url=http://www.altimesh.com/hybridizer-essentials/|website=Hybridizer|url-status=live|archive-url=https://web.archive.org/web/20171017150337/http://www.altimesh.com/hybridizer-essentials/|archive-date=17 October 2017|df=dmy-all}}</ref><ref>{{cite web|title=Home page|url=http://www.altimesh.com/|website=Altimesh|url-status=live|archive-url=https://web.archive.org/web/20171017145518/http://www.altimesh.com/|archive-date=17 October 2017|df=dmy-all}}</ref> It supports generics and virtual functions.<ref>{{cite web|title=Hybridizer generics and inheritance|url=http://www.altimesh.com/generics-and-inheritance/|url-status=live|archive-url=https://web.archive.org/web/20171017145927/http://www.altimesh.com/generics-and-inheritance/|archive-date=17 October 2017|df=dmy-all|date=2017-07-27}}</ref> Debugging and profiling is integrated with [[Visual Studio]] and Nsight.<ref>{{cite web|title=Debugging and Profiling with Hybridizer|url=http://www.altimesh.com/debugging-and-profiling/|url-status=live|archive-url=https://web.archive.org/web/20171017201449/http://www.altimesh.com/debugging-and-profiling/|archive-date=17 October 2017|df=dmy-all|date=2017-06-05}}</ref> It is available as a Visual Studio extension on Visual Studio Marketplace.
Line 47 ⟶ 49:
Due to a trend of increasing power of mobile GPUs, general-purpose programming became available also on the mobile devices running major [[mobile operating system]]s.
[[Google]] [[Android (operating system)|Android]] 4.2 enabled running [[RenderScript]] code on the mobile device GPU.<ref>{{cite web|url=http://developer.android.com/about/versions/android-4.2.html|title=Android 4.2 APIs - Android Developers|website=developer.android.com|url-status=live|archive-url=https://web.archive.org/web/20130826191621/http://developer.android.com/about/versions/android-4.2.html|archive-date=26 August 2013|df=dmy-all}}</ref> Renderscript has since been deprecated in favour of first OpenGL compute shaders<ref>{{cite web | url=https://developer.android.com/guide/topics/renderscript/migrate/migrate-gles | title=Migrate scripts to OpenGL ES 3.1 }}</ref> and later Vulkan Compute.<ref>{{cite web | url=https://developer.android.com/guide/topics/renderscript/migrate/migrate-vulkan | title=Migrate scripts to Vulkan }}</ref> OpenCL is available on many Android devices, but is not officially supported by Android.<ref>{{cite web|url=https://khronos.org/blog/catching-up-with-khronos-experts-qa-on-opencl-3.0-and-sycl-2020|title=Catching Up with Khronos:
==Hardware support==
Line 68 ⟶ 70:
===Vectorization===
{{See also|Vector_processor#GPU_vector_processing_features|SIMD|SWAR|Single instruction, multiple threads{{!}}SIMT}}
{{Unreferenced section|date=July 2017}}
Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once.{{Disputed inline|date=July 2025}} For example, if one color {{angbr|R1, G1, B1}} is to be modulated by another color {{angbr|R2, G2, B2}}, the GPU can produce the resulting color {{angbr|R1*R2, G1*G2, B1*B2}} in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional).{{citation needed|date=July 2017}} Examples include vertices, colors, normal vectors, and texture coordinates.
==GPU vs. CPU==
Line 82 ⟶ 85:
A simple example would be a GPU program that collects data about average [[lighting]] values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use [[edge detection]] to return both numerical information and a processed image representing outlines to a [[computer vision]] program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every [[pixel]] or other picture element in an image, it can analyze and average it (for the first example) or apply a [[Sobel operator|Sobel edge filter]] or other [[convolution]] filter (for the second) with much greater speed than a CPU, which typically must access slower [[random-access memory]] copies of the graphic in question.
GPGPU
===Caches===
Line 89 ⟶ 92:
===Register file===
GPUs have very large [[Register file|register files]], which allow them to reduce context-switching latency. Register file size is also increasing over different GPU generations, e.g., the total register file size on Maxwell (GM200), Pascal and Volta GPUs are 6 MiB, 14 MiB and 20 MiB, respectively.<ref>"[https://devblogs.nvidia.com/parallelforall/inside-pascal/ Inside Pascal: Nvidia’s Newest Computing Platform] {{webarchive|url=https://web.archive.org/web/20170507110037/https://devblogs.nvidia.com/parallelforall/inside-pascal/ |date=7 May 2017 }}"</ref><ref>"[https://devblogs.nvidia.com/inside-volta/ Inside Volta: The World’s Most Advanced Data Center GPU] {{webarchive|url=https://web.archive.org/web/20200101171030/https://devblogs.nvidia.com/inside-volta/ |date=1 January 2020 }}"</ref> By comparison, the size of a [[Processor register|register file on CPUs]] is small, typically tens or hundreds of kilobytes.
In essence: almost all GPU workloads are inherently massively-parallel LOAD-COMPUTE-STORE in nature, such as [[Tiled rendering]]. Even storing one temporary vector for further recall (LOAD-COMPUTE-STORE-COMPUTE-LOAD-COMPUTE-STORE) is so expensive due to the [[Random-access_memory#Memory_wall|Memory wall]] problem that it is to be avoided at all costs.<ref>{{cite book | last1=Li | first1=Jie | last2=Michelogiannakis | first2=George | last3=Cook | first3=Brandon | last4=Cooray | first4=Dulanya | last5=Chen | first5=Yong | title=High Performance Computing | chapter=Analyzing Resource Utilization in an HPC System: A Case Study of NERSC's Perlmutter | series=Lecture Notes in Computer Science | date=2023 | volume=13948 | pages=297–316 | doi=10.1007/978-3-031-32041-5_16 | isbn=978-3-031-32040-8 | chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-32041-5_16 }}</ref> The result is that register file size ''has'' to increase. In standard CPUs it is possible to introduce [[Cache (computing)|caches]] (a [[D-cache]]) to solve this problem, however these are relativrly so large that they are impractical to introduce in GPUs which would need one per Processing Element. [[ILLIAC IV]] innovatively solved the problem around 1967 by introducing a local memory per Processing Element (a PEM): a strategy copied by the [[Flynn%27s_taxonomy#Associative_processor|Aspex ASP]].
===Energy efficiency===
Line 105 ⟶ 110:
=== Linear algebra ===
Using GPU for numerical linear algebra began at least in 2001.<ref>{{Cite
==Stream processing==
Line 163 ⟶ 168:
====Flow control====
For accurate technical information on this topic see [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication]] and ILLIAC IV [[ILLIAC IV#Branches|"branching"]] (the term "predicate mask" did not exist in 1967).
In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs.<ref name="book">{{cite web|url=https://developer.nvidia.com/gpugems/GPUGems2/gpugems2_chapter34.html|title=GPU Gems – Chapter 34, GPU Flow-Control Idioms}}</ref><!--not really, branching could be zeroed out even on NV20, which gives roughly the same result--> Conditional writes could be performed using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible.
Line 210 ⟶ 217:
* [[Automatic parallelization]]<ref>Leung, Alan, Ondřej Lhoták, and Ghulam Lashari. "[https://cormack.uwaterloo.ca/~olhotak/pubs/pppj09.pdf Automatic parallelization for graphics processing units]." Proceedings of the 7th International Conference on Principles and Practice of Programming in Java. ACM, 2009.</ref><ref>Henriksen, Troels, Martin Elsman, and Cosmin E. Oancea. "[https://futhark-lang.org/publications/fhpc14.pdf Size slicing: a hybrid approach to size inference in futhark]." Proceedings of the 3rd ACM SIGPLAN workshop on Functional high-performance computing. ACM, 2014.</ref><ref>{{Cite book |chapter-url=https://www.researchgate.net/publication/221235428 |doi=10.1145/1375527.1375562|chapter=A compiler framework for optimization of affine loop nests for gpgpus |title=Proceedings of the 22nd annual international conference on Supercomputing - ICS '08 |year=2008 |last1=Baskaran |first1=Muthu Manikandan |last2=Bondhugula |first2=Uday |last3=Krishnamoorthy |first3=Sriram |last4=Ramanujam |first4=J. |last5=Rountev |first5=Atanas |last6=Sadayappan |first6=P. |page=225 |isbn=9781605581583 |s2cid=6137960 }}</ref>
* [[Computational physics|Physical based simulation]] and [[physics engine]]s<ref name="Joselli">
** [[Conway's Game of Life]], [[cloth simulation]], fluid [[incompressible flow]] by solution of [[Euler equations (fluid dynamics)]]<ref>{{cite web|url=https://developer.nvidia.com/gpugems/gpugems3/part-v-physics-simulation/chapter-30-real-time-simulation-and-rendering-3d-fluids|title=K. Crane, I. Llamas, S. Tariq, 2008. Real-Time Simulation and Rendering of 3D Fluids. In Nvidia: GPU Gems 3, Chapter 30.}}</ref> or [[Navier–Stokes equations]]<ref>{{cite web|url=http://developer.nvidia.com/GPUGems/gpugems_ch38.html|title=M. Harris, 2004. Fast Fluid Dynamics Simulation on the GPU. In Nvidia: GPU Gems, Chapter 38.|work=NVIDIA Developer |url-status=live|archive-url=https://web.archive.org/web/20171007170306/https://developer.nvidia.com/GPUGems/gpugems_ch38.html|archive-date=7 October 2017|df=dmy-all}}</ref>
* [[Statistical physics]]
** [[Ising model]]<ref>{{cite journal | arxiv=1007.3726 | doi=10.1016/j.cpc.2010.05.005 | title=Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model | year=2010 | last1=Block | first1=Benjamin | last2=Virnau | first2=Peter | last3=Preis | first3=Tobias | journal=Computer Physics Communications | volume=181 | issue=9 | pages=1549–1556 | bibcode=2010CoPhC.181.1549B | s2cid=14828005 }}</ref>
Line 218 ⟶ 225:
* [[Level set methods]]
* [[Computed tomography|CT]] reconstruction<ref>Jimenez, Edward S., and Laurel J. Orr. "[https://www.osti.gov/servlets/purl/1106909 Rethinking the union of computed tomography reconstruction and GPGPU computing]." Penetrating Radiation Systems and Applications XIV. Vol. 8854. International Society for Optics and Photonics, 2013.</ref>
* [[Fast Fourier transform]]<ref>{{Cite journal |url=https://www.researchgate.net/publication/5462925 |doi=10.1109/TMI.2007.909834 |title=Accelerating the Nonequispaced Fast Fourier Transform on Commodity Graphics Hardware |year=2008 |last1=Sorensen |first1=T.S. |last2=Schaeffter |first2=T. |last3=Noe |first3=K.O. |last4=Hansen |first4=M.S. |journal=IEEE Transactions on Medical Imaging |volume=27 |issue=4 |pages=538–547 |pmid=18390350 |bibcode=2008ITMI...27..538S |s2cid=206747049 }}</ref>
* GPU learning{{snd}} [[machine learning]] and [[data mining]] computations, e.g., with software BIDMach
* [[k-nearest neighbor algorithm]]<ref>{{cite arXiv | eprint=0804.1448 | last1=Garcia | first1=Vincent | last2=Debreuve | first2=Eric | last3=Barlaud | first3=Michel | title=Fast k Nearest Neighbor Search using GPU | year=2008 | class=cs.CV }}</ref>
Line 265 ⟶ 272:
* [[Number theory]]
** [[Primality test]]ing and [[integer factorization]]<ref>{{cite web|url=https://mersenne.org/various/works.php|title=How GIMPS Works|work=Great Internet Mersenne Prime Search|access-date=6 March 2025}}</ref>
* [[Bioinformatics]]<ref>{{cite journal|doi=10.1186/1471-2105-8-474|pmid=18070356|pmc=2222658|title=High-throughput sequence alignment using Graphics Processing Units|journal=BMC Bioinformatics|volume=8|
* [[Medical imaging]]
* [[Clinical decision support system]] (CDSS)<ref>{{cite journal|last1=Olejnik|first1=M|last2=Steuwer|first2=M|last3=Gorlatch|first3=S|last4=Heider|first4=D|title=gCUP: rapid GPU-based HIV-1 co-receptor usage prediction for next-generation sequencing.|journal=Bioinformatics|date=15 November 2014|volume=30|issue=22|pages=3272–3|pmid=25123901|doi=10.1093/bioinformatics/btu535|doi-access=free}}</ref>
Line 394 ⟶ 401:
** [[Advanced Simulation Library]]
** [[Physics processing unit]] (PPU)
* {{Annotated link|Vector processor}}
* {{Annotated link|Single instruction, multiple threads}}
==References==
Line 399 ⟶ 408:
== Further reading ==
* {{Cite journal |last1=Owens |first1=J.D. |last2=Houston |first2=M. |last3=Luebke |first3=D. |last4=Green |first4=S. |last5=Stone |first5=J.E. |last6=Phillips |first6=J.C. |date=May 2008 |title=GPU Computing
* {{Cite journal |last1=Brodtkorb |first1=André R. |last2=Hagen |first2=Trond R. |last3=Sætra |first3=Martin L. |date=2013-01-01 |title=Graphics processing unit (GPU) programming strategies and trends in GPU computing |url=https://www.sciencedirect.com/science/article/pii/S0743731512000998 |journal=Journal of Parallel and Distributed Computing |series=Metaheuristics on GPUs |volume=73 |issue=1 |pages=4–13 |doi=10.1016/j.jpdc.2012.04.003 |issn=0743-7315|hdl=10852/40283 |hdl-access=free }}
|