General-purpose computing on graphics processing units: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 06:18, 16 February 2025 edit Ixfd64 (talk \| contribs) Edit filter managers, Administrators 70,386 edits →Mobile computers: cite ← Previous edit		Latest revision as of 10:11, 22 August 2025 edit undo KylieTastic (talk \| contribs) Autopatrolled, Administrators 520,992 edits no longer true
(43 intermediate revisions by 17 users not shown)
Line 1: {{Short description\|Use of a GPU for computations typically assigned to CPUs}} {{Use dmy dates\|date=January 2015}} {{More citations needed\|date=February 2022}} '''General-purpose computing on graphics processing units''' ('''GPGPU''', or less often '''GPGP''') is the use of a [[graphics processing unit]] (GPU), which typically handles computation only for [[computer graphics]], to perform computation in applications traditionally handled by the [[central processing unit]] (CPU).<ref>{{Cite conference \|last1=Fung \|first1=James \|last2=Tang \|first2=Felix \|last3=Mann \|first3=Steve \|date=7–10 October 2002 \|title=Mediated Reality Using Computer Graphics Hardware for Computer Vision \|url=http://www.eyetap.org/papers/docs/iswc02-fung.pdf \|conference=Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002) \|___location=Seattle, Washington, USA \|pages=83–89 \|archive-url=https://web.archive.org/web/20120402173637/http://www.eyetap.org/~fungja/glorbits_final.pdf \|archive-date=2 April 2012}}</ref><ref name="Aimone">{{cite journal \| url=https://link.springer.com/article/10.1007/s00779-003-0239-6 \| doi=10.1007/s00779-003-0239-6 \| title=An Eye ''Tap'' video-based featureless projective motion estimation assisted by gyroscopic tracking for wearable computer mediated reality \| year=2003 \| last1=Aimone \| first1=Chris \| last2=Fung \| first2=James \| last3=Mann \| first3=Steve \| journal=Personal and Ubiquitous Computing \| volume=7 \| issue=5 \| pages=236–248 \| s2cid=25168728 \| url-access=subscription }}</ref><ref>[http://www.eyetap.org/papers/docs/procicassp2004.pdf "Computer Vision Signal Processing on Graphics Processing Units", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004)] {{webarchive\|url=https://web.archive.org/web/20110819000326/http://www.eyetap.org/papers/docs/procicassp2004.pdf \|date=19 August 2011 }}: Montreal, Quebec, Canada, 17–21 May 2004, pp. V-93 – V-96</ref><ref>Chitty, D. M. (2007, July). [https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf A data parallel approach to genetic programming using programmable graphics hardware] {{webarchive\|url=https://web.archive.org/web/20170808190114/https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf \|date=8 August 2017 }}. In Proceedings of the 9th annual conference on Genetic and evolutionary computation (pp. 1566-1573). ACM.</ref> The use of multiple [[video card]]s in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.<ref>[http://eyetap.org/papers/docs/procicpr2004.pdf "Using Multiple Graphics Cards as a General Purpose Parallel Computer: Applications to Computer Vision", Proceedings of the 17th International Conference on Pattern Recognition (ICPR2004)] {{webarchive\|url=https://web.archive.org/web/20110718193841/http://eyetap.org/papers/docs/procicpr2004.pdf \|date=18 July 2011 }}, Cambridge, United Kingdom, 23–26 August 2004, volume 1, pages 805–808.</ref> Essentially, a GPGPU [[graphics pipeline\|pipeline]] is a kind of [[Parallel computing\|parallel processing]] between one or more GPUs and CPUs, ~~that~~with ~~analyzes~~special ~~data~~accelerated ~~as if it~~instructions ~~were~~for inprocessing image or other graphic ~~form~~forms of data. While GPUs operate at lower frequencies, they typically have many times the number of [[~~Multi-core~~Single ~~processor~~instruction, multiple threads\|~~cores~~Processing elements]]. Thus, GPUs can process far more pictures and other graphical data per second than a traditional CPU. Migrating data into ~~graphical~~parallel form and then using the GPU to ~~scan and analyze~~process it can (theoretically) create a large [[speedup]]. GPGPU pipelines were developed at the beginning of the 21st century for [[graphics processing]] (e.g. for better [[shader]]s). ~~These~~From ~~pipelines~~the ~~were~~[[history ~~found~~of tosupercomputing]] ~~fit~~it is well-known that [[scientific computing]] ~~needs~~drives ~~well,~~the ~~and~~largest ~~have~~concentrations ~~since~~of ~~been~~Computing ~~developed~~power in ~~this~~history, ~~direction~~listed in the [[TOP500]]: the majority today utilize [[GPU]]s. The ~~most~~ best-known GPGPUs are [[Nvidia Tesla]] that are used for [[Nvidia DGX]], alongside [[AMD Instinct]] and Intel Gaudi. ==History== Line 15 ⟶ 16: In principle, any arbitrary [[Boolean function]], including addition, multiplication, and other mathematical functions, can be built up from a [[functional completeness\|functionally complete]] set of logic operators. In 1987, [[Conway's Game of Life]] became one of the first examples of general-purpose computing using an early [[stream processing\|stream processor]] called a [[blitter]] to invoke a special sequence of [[bit blit\|logical operations]] on bit vectors.<ref>{{cite journal\|last=Hull\|first=Gerald\|title=LIFE\|journal=Amazing Computing\|volume=2\|issue=12\|pages=81–84\|date=December 1987\|url=https://archive.org/stream/amazing-computing-magazine-1987-12/Amazing_Computing_Vol_02_12_1987_Dec#page/n81/mode/2up}}</ref> General-purpose computing on GPUs became more practical and popular after about 2001, with the advent of both programmable [[shader]]s and [[floating point]] support on graphics processors. Notably, problems involving [[matrix (mathematics)\|matrices]] and/or [[vector (mathematics and physics)\|vector]]s{{snd}} especially two-, three-, or four-dimensional vectors{{snd}} were easy to translate to a GPU, which acts with native speed and support on those types. A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs.<ref>{{Cite journal \|last1=Krüger \|first1=Jens \|last2=Westermann \|first2=Rüdiger \|date=July 2003 \|title=Linear algebra operators for GPU implementation of numerical algorithms \|url=https://dl.acm.org/doi/10.1145/882262.882363 \|journal=ACM Transactions on Graphics \|language=en \|volume=22 \|issue=3 \|pages=908–916 \|doi=10.1145/882262.882363 \|issn=0730-0301\|url-access=subscription }}</ref><ref>{{Cite journal \|last1=Bolz \|first1=Jeff \|last2=Farmer \|first2=Ian \|last3=Grinspun \|first3=Eitan \|last4=Schröder \|first4=Peter \|date=July 2003 \|title=Sparse matrix solvers on the GPU: conjugate gradients and multigrid \|url=https://dl.acm.org/doi/10.1145/882262.882364 \|journal=ACM Transactions on Graphics \|language=en \|volume=22 \|issue=3 \|pages=917–924 \|doi=10.1145/882262.882364 \|issn=0730-0301\|url-access=subscription }}</ref> These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives, as supported by the two major APIs for graphics processors, [[OpenGL]] and [[DirectX]]. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such as [[Lib Sh\|Sh]]/[[RapidMind]], [[BrookGPU\|Brook]] and Accelerator.<ref>{{cite journal \|last1=Tarditi \|first1=David \|first2=Sidd \|last2=Puri \|first3=Jose \|last3=Oglesby \|title=Accelerator: using data parallelism to program GPUs for general-purpose uses \|journal=ACM SIGARCH Computer Architecture News \|volume=34 \|issue=5 \|date=2006\|url=https://www.cs.cmu.edu/afs/cs/academic/class/15740-f07/public/discussion-papers/26-tarditi-asplos06.pdf\|doi=10.1145/1168919.1168898 }}</ref><ref>{{cite journal \|last1=Che \|first1=Shuai \|last2=Boyer \|first2=Michael \|last3=Meng \|first3=Jiayuan \|last4=Tarjan \|first4=D. \|last5=Sheaffer \|first5=Jeremy W. \|last6=Skadron \|first6=Kevin \|title=A performance study of general-purpose applications on graphics processors using CUDA \|journal=J. Parallel and Distributed Computing \|volume=68 \|issue=10 \|date=2008 \|pages=1370–1380 \|doi=10.1016/j.jpdc.2008.05.014 \|df=dmy-all \|citeseerx=10.1.1.143.4849 }}</ref><ref>{{cite journal \|last1=Glaser \|first1=J. \|last2=Nguyen \|first2=T. D. \|last3=Anderson \|first3=J. A. \|last4=Lui \|first4=P. \|last5=Spiga \|first5=F. \|last6=Millan \|first6=J. A. \|last7=Morse \|first7=D. C. \|last8=Glotzer \|first8=S. C. \|date=2015 \|title=Strong scaling of general-purpose molecular dynamics simulations on GPUs \|journal=Computer Physics Communications \|volume=192 \|pages=97–107 \| doi=10.1016/j.cpc.2015.02.028\|arxiv=1412.3387 \|bibcode=2015CoPhC.192...97G \| doi-access=free}}</ref> These were followed by Nvidia's [[CUDA]], which allowed programmers to ignore the underlying graphical concepts in favor of more common [[high-performance computing]] concepts.<ref name="du">{{Cite journal \|doi= 10.1016/j.parco.2011.10.002 \|title= From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming \|journal= Parallel Computing \|volume= 38 \|issue= 8 \|pages= 391–407 \|year= 2012 \|last1= Du \|first1= Peng \|last2= Weber \|first2= Rick \|last3= Luszczek \|first3= Piotr \|last4= Tomov \|first4= Stanimire \|last5= Peterson \|first5= Gregory \|last6= Dongarra \|first6= Jack \|author-link6= Jack Dongarra \|df= dmy-all \|citeseerx= 10.1.1.193.7712 }}</ref> Newer, hardware-vendor-independent offerings include Microsoft's [[DirectCompute]] and Apple/Khronos Group's [[OpenCL]].<ref name="du"/> This means that modern GPGPU pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form. Line 22 ⟶ 23: ==Implementations== ===Software libraries and APIs=== Any language that allows the code running on the CPU to poll a GPU [[shader]] for return values, can create a GPGPU framework. Programming standards for parallel computing include [[OpenCL]] (vendor-independent), [[OpenACC]], [[OpenMP]] and [[OpenHMPP]]. Line 30 ⟶ 32: [[ROCm]], launched in 2016, is AMD's open-source response to CUDA. It is, as of 2022, on par with CUDA with regards to features,{{source?\|date=December 2024}} and still lacking in consumer support.{{source?\|date=December 2024}} OpenVIDIA was developed at [[University of Toronto]] between 2003–2005,<ref name="Fung">~~James~~{{cite book \| last1 = Fung, \| first1 = James \| last2 = Mann \| first2 = Steve \| author-link2 = Steve Mann, ~~Chris~~(inventor) \| last3 = Aimone, "\| first3 = Chris \| chapter = OpenVIDIA: Parallel GPU computer vision \| title = Proceedings of the 13th annual ACM international conference on Multimedia \| publication-date = 6 November 2005 \| date = 6–11 November 2005 \| isbn = 1595930442 \| publisher = [[Association for Computing Machinery]] \| ___location = Singapore \| doi = 10.1145/1101149.1101334 \| pages = 849–852 \| accessdate = 18 March 2025 \| chapter-url = http://www.eyetap.org/papers/docs/oss1-fung.pdf ~~OpenVIDIA: Parallel GPU Computer Vision] {{Webarchive~~\| archive-url = https://web.archive.org/web/20191223164955/http://www.eyetap.org/papers/docs/oss1-fung.pdf \| archive-date = 23 December 2019 }}~~", Proceedings of the ACM Multimedia 2005, Singapore, 6–11 November 2005, pages 849–852~~</ref> in collaboration with Nvidia. Altimesh Hybridizer created by [[Altimesh]] compiles [[Common Intermediate Language]] to CUDA binaries.<ref>{{cite web\|title=Hybridizer\|url=http://www.altimesh.com/hybridizer-essentials/\|website=Hybridizer\|url-status=live\|archive-url=https://web.archive.org/web/20171017150337/http://www.altimesh.com/hybridizer-essentials/\|archive-date=17 October 2017\|df=dmy-all}}</ref><ref>{{cite web\|title=Home page\|url=http://www.altimesh.com/\|website=Altimesh\|url-status=live\|archive-url=https://web.archive.org/web/20171017145518/http://www.altimesh.com/\|archive-date=17 October 2017\|df=dmy-all}}</ref> It supports generics and virtual functions.<ref>{{cite web\|title=Hybridizer generics and inheritance\|url=http://www.altimesh.com/generics-and-inheritance/\|url-status=live\|archive-url=https://web.archive.org/web/20171017145927/http://www.altimesh.com/generics-and-inheritance/\|archive-date=17 October 2017\|df=dmy-all\|date=2017-07-27}}</ref> Debugging and profiling is integrated with [[Visual Studio]] and Nsight.<ref>{{cite web\|title=Debugging and Profiling with Hybridizer\|url=http://www.altimesh.com/debugging-and-profiling/\|url-status=live\|archive-url=https://web.archive.org/web/20171017201449/http://www.altimesh.com/debugging-and-profiling/\|archive-date=17 October 2017\|df=dmy-all\|date=2017-06-05}}</ref> It is available as a Visual Studio extension on Visual Studio Marketplace. Line 47 ⟶ 49: Due to a trend of increasing power of mobile GPUs, general-purpose programming became available also on the mobile devices running major [[mobile operating system]]s. [[Google]] [[Android (operating system)\|Android]] 4.2 enabled running [[RenderScript]] code on the mobile device GPU.<ref>{{cite web\|url=http://developer.android.com/about/versions/android-4.2.html\|title=Android 4.2 APIs - Android Developers\|website=developer.android.com\|url-status=live\|archive-url=https://web.archive.org/web/20130826191621/http://developer.android.com/about/versions/android-4.2.html\|archive-date=26 August 2013\|df=dmy-all}}</ref> Renderscript has since been deprecated in favour of first OpenGL compute shaders<ref>{{cite web \| url=https://developer.android.com/guide/topics/renderscript/migrate/migrate-gles \| title=Migrate scripts to OpenGL ES 3.1 }}</ref> and later Vulkan Compute.<ref>{{cite web \| url=https://developer.android.com/guide/topics/renderscript/migrate/migrate-vulkan \| title=Migrate scripts to Vulkan }}</ref> OpenCL is available on many Android devices, but is not officially supported by Android.<ref>{{cite web\|url=https://khronos.org/blog/catching-up-with-khronos-experts-qa-on-opencl-3.0-and-sycl-2020\|title=Catching Up with Khronos: ~~Experts’~~Experts' Q&A on OpenCL 3.0 and SYCL 2020\|last=McIntosh-Smith\|first=Simon\|date=2020-07-15\|publisher=The Khronos Group\|access-date=16 February 2025}}</ref> [[Apple Inc.\|Apple]] introduced the proprietary [[Metal (API)\|Metal]] API for [[iOS]] applications, able to execute arbitrary code through Apple's GPU compute shaders.{{fact\|date=June 2024}} ==Hardware support== Line 68 ⟶ 70: ===Vectorization=== {{See also\|Vector_processor#GPU_vector_processing_features\|SIMD\|SWAR\|Single instruction, multiple threads{{!}}SIMT}} {{Unreferenced section\|date=July 2017}} Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once.{{Disputed inline\|date=July 2025}} For example, if one color {{angbr\|R1, G1, B1}} is to be modulated by another color {{angbr\|R2, G2, B2}}, the GPU can produce the resulting color {{angbr\|R1R2, G1G2, B1B2}} in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional).{{citation needed\|date=July 2017}} Examples include vertices, colors, normal vectors, and texture coordinates. Many other applications can put this to good use, and because of their higher performance, vector instructions, termed single instruction, multiple data ([[Single instruction, multiple data\|SIMD]]), have long been available on CPUs.{{citation needed\|date=July 2017}} ==GPU vs. CPU== Line 82 ⟶ 85: A simple example would be a GPU program that collects data about average [[lighting]] values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use [[edge detection]] to return both numerical information and a processed image representing outlines to a [[computer vision]] program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every [[pixel]] or other picture element in an image, it can analyze and average it (for the first example) or apply a [[Sobel operator\|Sobel edge filter]] or other [[convolution]] filter (for the second) with much greater speed than a CPU, which typically must access slower [[random-access memory]] copies of the graphic in question. GPGPU ~~is fundamentally~~as a software concept~~, not a hardware concept; it~~ is a type of [[algorithm]], not a piece of equipment. Specialized equipment designs may, however, even further enhance the efficiency of GPGPU pipelines, which traditionally perform relatively few algorithms on very large amounts of data. Massively parallelized, gigantic-data-level tasks thus may be parallelized even further via specialized setups such as rack computing (many similar, highly tailored machines built into a ''rack''), which adds a third layer{{snd}} many computing units each using many CPUs to correspond to many GPUs. Some [[Bitcoin]] "miners" used such setups for high-quantity processing. Insights into the largest such systems in the world has been maintained at the [[TOP500]] supercomputer list. ===Caches=== Line 89 ⟶ 92: ===Register file=== GPUs have very large [[Register file\|register files]], which allow them to reduce context-switching latency. Register file size is also increasing over different GPU generations, e.g., the total register file size on Maxwell (GM200), Pascal and Volta GPUs are 6 MiB, 14 MiB and 20 MiB, respectively.<ref>"[https://devblogs.nvidia.com/parallelforall/inside-pascal/ Inside Pascal: Nvidia’s Newest Computing Platform] {{webarchive\|url=https://web.archive.org/web/20170507110037/https://devblogs.nvidia.com/parallelforall/inside-pascal/ \|date=7 May 2017 }}"</ref><ref>"[https://devblogs.nvidia.com/inside-volta/ Inside Volta: The World’s Most Advanced Data Center GPU] {{webarchive\|url=https://web.archive.org/web/20200101171030/https://devblogs.nvidia.com/inside-volta/ \|date=1 January 2020 }}"</ref> By comparison, the size of a [[Processor register\|register file on CPUs]] is small, typically tens or hundreds of kilobytes. In essence: almost all GPU workloads are inherently massively-parallel LOAD-COMPUTE-STORE in nature, such as [[Tiled rendering]]. Even storing one temporary vector for further recall (LOAD-COMPUTE-STORE-COMPUTE-LOAD-COMPUTE-STORE) is so expensive due to the [[Random-access_memory#Memory_wall\|Memory wall]] problem that it is to be avoided at all costs.<ref>{{cite book \| last1=Li \| first1=Jie \| last2=Michelogiannakis \| first2=George \| last3=Cook \| first3=Brandon \| last4=Cooray \| first4=Dulanya \| last5=Chen \| first5=Yong \| title=High Performance Computing \| chapter=Analyzing Resource Utilization in an HPC System: A Case Study of NERSC's Perlmutter \| series=Lecture Notes in Computer Science \| date=2023 \| volume=13948 \| pages=297–316 \| doi=10.1007/978-3-031-32041-5_16 \| isbn=978-3-031-32040-8 \| chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-32041-5_16 }}</ref> The result is that register file size ''has'' to increase. In standard CPUs it is possible to introduce [[Cache (computing)\|caches]] (a [[D-cache]]) to solve this problem, however these are relativrly so large that they are impractical to introduce in GPUs which would need one per Processing Element. [[ILLIAC IV]] innovatively solved the problem around 1967 by introducing a local memory per Processing Element (a PEM): a strategy copied by the [[Flynn%27s_taxonomy#Associative_processor\|Aspex ASP]]. ===Energy efficiency=== Line 105 ⟶ 110: === Linear algebra === Using GPU for numerical linear algebra began at least in 2001.<ref>{{Cite ~~journal~~book \|~~last~~last1=Larsen \|~~first~~first1=E. Scott \|last2=McAllister \|first2=David \|chapter=Fast matrix multiplies using graphics hardware \|date=2001-11-10 \|title=~~Fast~~Proceedings ~~matrix~~of ~~multiplies~~the ~~using~~2001 ~~graphics~~ACM/IEEE ~~hardware~~conference on Supercomputing \|chapter-url=https://dl.acm.org/doi/10.1145/582034.582089 \|language=en \|publisher=ACM \|pages=~~55–55~~55 \|doi=10.1145/582034.582089 \|isbn=978-1-58113-293-9}}</ref> It had been used for Gauss-Seidel solver, conjugate gradients, etc.<ref>{{Cite ~~journal~~book \|~~last~~last1=Krüger \|~~first~~first1=Jens \|last2=Westermann \|first2=Rüdiger \|title=ACM SIGGRAPH 2005 Courses on - SIGGRAPH '05 \|date=2005 \|~~title~~chapter=Linear algebra operators for GPU implementation of numerical algorithms \|chapter-url=http://portal.acm.org/citation.cfm?doid=1198555.1198795 \|language=en \|publisher=ACM Press \|pages=234 \|doi=10.1145/1198555.1198795}}</ref> ==Stream processing== Line 163 ⟶ 168: ====Flow control==== For accurate technical information on this topic see [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication]] and ILLIAC IV [[ILLIAC IV#Branches\|"branching"]] (the term "predicate mask" did not exist in 1967). In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs.<ref name="book">{{cite web\|url=https://developer.nvidia.com/gpugems/GPUGems2/gpugems2_chapter34.html\|title=GPU Gems – Chapter 34, GPU Flow-Control Idioms}}</ref><!--not really, branching could be zeroed out even on NV20, which gives roughly the same result--> Conditional writes could be performed using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible. Line 181 ⟶ 188: ====Scan==== The scan operation, also termed ''[[prefix sum#Parallel algorithm\|parallel prefix sum]]'', takes in a vector (stream) of data elements and an [[monoid\|(arbitrary) associative binary function '+' with an identity element 'i']]. If the input is [a0, a1, a2, a3, ...], an ''exclusive scan'' produces the output [i, a0, a0 + a1, a0 + a1 + a2, ...], while an ''inclusive scan'' produces the output [a0, a0 + a1, a0 + a1 + a2, a0 + a1 + a2 + a3, ...] and [[semigroup\|does not require an identity]] to exist. While at first glance the operation may seem inherently serial, efficient parallel scan algorithms are possible and have been implemented on graphics processing units. The scan operation has uses in e.g., quicksort and sparse matrix-vector multiplication.<ref name=goddeke2010 /><ref>{{cite web\|url=http://www.idav.ucdavis.edu/func/return_pdf?pub_id=915\|title=S. Sengupta, M. Harris, Y. Zhang, J. D. Owens, 2007. Scan primitives for GPU computing. In T. Aila and M. Segal (eds.): Graphics Hardware (2007).\|url-status=dead\|archive-url=https://web.archive.org/web/20150605081020/http://www.idav.ucdavis.edu/func/return_pdf?pub_id=915\|archive-date=5 June 2015\|df=dmy-all\|access-date=16 December 2014}}</ref><ref>{{cite journal \| last1 = Blelloch \| first1 = G. E. \| year = 1989 \| title = Scans as primitive parallel operations \| url = http://www.cs.berkeley.edu/~knight/cs267/papers/scan_primitive.pdf \| journal = IEEE Transactions on Computers \| volume = 38 \| issue = 11 \| pages = 1526–1538 \| doi = 10.1109/12.42122 \| url-status = dead \| archive-url = https://web.archive.org/web/20150923211604/http://www.cs.berkeley.edu/~knight/cs267/papers/scan_primitive.pdf \| archive-date = 23 September 2015 \| df = dmy-all \| access-date = 16 December 2014 }}</ref><ref>{{cite web\|url=~~http~~https://developer.nvidia.com/~~GPUGems3~~gpugems/~~gpugems3_ch39.html~~gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda\|title=M. Harris, S. Sengupta, J. D. Owens. Parallel Prefix Sum (Scan) with CUDA. In Nvidia: GPU Gems 3, Chapter 39.~~}}{{dead link\|date=April 2018 \|bot=SheriffIsInTown \|fix-attempted=yes~~ }}</ref> ====Scatter==== Line 210 ⟶ 217: [[Automatic parallelization]]<ref>Leung, Alan, Ondřej Lhoták, and Ghulam Lashari. "[https://cormack.uwaterloo.ca/~olhotak/pubs/pppj09.pdf Automatic parallelization for graphics processing units]." Proceedings of the 7th International Conference on Principles and Practice of Programming in Java. ACM, 2009.</ref><ref>Henriksen, Troels, Martin Elsman, and Cosmin E. Oancea. "[https://futhark-lang.org/publications/fhpc14.pdf Size slicing: a hybrid approach to size inference in futhark]." Proceedings of the 3rd ACM SIGPLAN workshop on Functional high-performance computing. ACM, 2014.</ref><ref>{{Cite book \|chapter-url=https://www.researchgate.net/publication/221235428 \|doi=10.1145/1375527.1375562\|chapter=A compiler framework for optimization of affine loop nests for gpgpus \|title=Proceedings of the 22nd annual international conference on Supercomputing - ICS '08 \|year=2008 \|last1=Baskaran \|first1=Muthu Manikandan \|last2=Bondhugula \|first2=Uday \|last3=Krishnamoorthy \|first3=Sriram \|last4=Ramanujam \|first4=J. \|last5=Rountev \|first5=Atanas \|last6=Sadayappan \|first6=P. \|page=225 \|isbn=9781605581583 \|s2cid=6137960 }}</ref> * [[Computational physics\|Physical based simulation]] and [[physics engine]]s<ref name="Joselli">~~Joselli,~~{{cite ~~Mark,~~book et\| ~~al. "[http~~chapter-url=https://~~www~~dl.~~academia~~acm.~~edu~~org/~~download~~doi/~~31203562~~10.1145/~~sandbox2008~~1401843.~~pdf~~1401871 \| doi=10.1145/1401843.1401871 \| chapter=A new physics engine with automatic process distribution between CPU-GPU~~]{{dead~~ ~~link~~\|~~date=July~~ ~~2022\|bot~~title=~~medic}}{{cbignore\|bot=medic}}."~~ Proceedings of the 2008 ACM SIGGRAPH symposium on Video games. ~~ACM,~~\| date=2008. \| last1=Joselli \| first1=Mark \| last2=Clua \| first2=Esteban \| last3=Montenegro \| first3=Anselmo \| last4=Conci \| first4=Aura \| last5=Pagliosa \| first5=Paulo \| pages=149–156 \| isbn=978-1-60558-173-6 }}</ref> (usually based on [[Newtonian physics]] models) ** [[Conway's Game of Life]], [[cloth simulation]], fluid [[incompressible flow]] by solution of [[Euler equations (fluid dynamics)]]<ref>{{cite web\|url=~~http~~https://developer.nvidia.com/~~GPUGems3~~gpugems/~~gpugems3_ch30.html~~gpugems3/part-v-physics-simulation/chapter-30-real-time-simulation-and-rendering-3d-fluids\|title=K. Crane, I. Llamas, S. Tariq, 2008. Real-Time Simulation and Rendering of 3D Fluids. In Nvidia: GPU Gems 3, Chapter 30.~~}}{{dead link\|date=April 2018 \|bot=SheriffIsInTown \|fix-attempted=yes~~ }}</ref> or [[Navier–Stokes equations]]<ref>{{cite web\|url=http://developer.nvidia.com/GPUGems/gpugems_ch38.html\|title=M. Harris, 2004. Fast Fluid Dynamics Simulation on the GPU. In Nvidia: GPU Gems, Chapter 38.\|work=NVIDIA Developer \|url-status=live\|archive-url=https://web.archive.org/web/20171007170306/https://developer.nvidia.com/GPUGems/gpugems_ch38.html\|archive-date=7 October 2017\|df=dmy-all}}</ref> * [[Statistical physics]] ** [[Ising model]]<ref>{{cite journal \| arxiv=1007.3726 \| doi=10.1016/j.cpc.2010.05.005 \| title=Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model \| year=2010 \| last1=Block \| first1=Benjamin \| last2=Virnau \| first2=Peter \| last3=Preis \| first3=Tobias \| journal=Computer Physics Communications \| volume=181 \| issue=9 \| pages=1549–1556 \| bibcode=2010CoPhC.181.1549B \| s2cid=14828005 }}</ref> * [[Lattice gauge theory]]<ref>{{cite web\|url=https://indico.fnal.gov/event/22303/contributions/245806/attachments/157699/206544/SnowmassTalk.pdf\|title=New Computational Trends in Lattice Gauge Theory\|last=Boyle\|first=Peter\|publisher=Lawrence Berkeley National Laboratory\|access-date=16 February 2025}}</ref> * [[Lattice gauge theory]]{{citation needed\|date=May 2019}} * [[Segmentation (image processing)\|Segmentation]]{{snd}} 2D and 3D<ref>{{cite journal \| pmc=3657761 \| year=2011 \| last1=Sun \| first1=S. \| last2=Bauer \| first2=C. \| last3=Beichel \| first3=R. \| title=Automated 3-D Segmentation of Lungs with Lung Cancer in CT Data Using a Novel Robust Active Shape Model Approach \| journal=IEEE Transactions on Medical Imaging \| volume=31 \| issue=2 \| pages=449–460 \| doi=10.1109/TMI.2011.2171357 \| pmid=21997248 }}</ref> * [[Level set methods]] * [[Computed tomography\|CT]] reconstruction<ref>Jimenez, Edward S., and Laurel J. Orr. "[https://www.osti.gov/servlets/purl/1106909 Rethinking the union of computed tomography reconstruction and GPGPU computing]." Penetrating Radiation Systems and Applications XIV. Vol. 8854. International Society for Optics and Photonics, 2013.</ref> * [[Fast Fourier transform]]<ref>{{Cite journal \|url=https://www.researchgate.net/publication/5462925 \|doi=10.1109/TMI.2007.909834 \|title=Accelerating the Nonequispaced Fast Fourier Transform on Commodity Graphics Hardware \|year=2008 \|last1=Sorensen \|first1=T.S. \|last2=Schaeffter \|first2=T. \|last3=Noe \|first3=K.O. \|last4=Hansen \|first4=M.S. \|journal=IEEE Transactions on Medical Imaging \|volume=27 \|issue=4 \|pages=538–547 \|pmid=18390350 \|bibcode=2008ITMI...27..538S \|s2cid=206747049 }}</ref> * GPU learning{{snd}} [[machine learning]] and [[data mining]] computations, e.g., with software BIDMach * [[k-nearest neighbor algorithm]]<ref>{{cite arXiv \| eprint=0804.1448 \| last1=Garcia \| first1=Vincent \| last2=Debreuve \| first2=Eric \| last3=Barlaud \| first3=Michel \| title=Fast k Nearest Neighbor Search using GPU \| year=2008 \| class=cs.CV }}</ref> Line 263 ⟶ 270: [[Quantum mechanical]] physics [[Astrophysics]]<ref>{{cite web\|url=http://www.astro.lu.se/compugpu2010/\|title=Computational Physics with GPUs: Lund Observatory\|website=www.astro.lu.se\|url-status=live\|archive-url=https://web.archive.org/web/20100712062316/http://www.astro.lu.se/compugpu2010/\|archive-date=12 July 2010\|df=dmy-all}}</ref> * [[Number theory]] * [[Bioinformatics]]<ref>{{cite journal\|doi=10.1186/1471-2105-8-474\|pmid=18070356\|pmc=2222658\|title=High-throughput sequence alignment using Graphics Processing Units\|journal=BMC Bioinformatics\|volume=8\|pages=474\|year=2007\|last1=Schatz\|first1=Michael C\|last2=Trapnell\|first2=Cole\|last3=Delcher\|first3=Arthur L\|last4=Varshney\|first4=Amitabh \|doi-access=free }}</ref><ref name=Manavski2008>{{cite journal \|author=Svetlin A. Manavski \|author2=Giorgio Valle \|title=CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment \|journal=BMC Bioinformatics \|volume=9 \|issue=Suppl. 2 \|page=S10 \|date=2008 \|doi=10.1186/1471-2105-9-s2-s10 \|pmid=18387198 \|pmc=2323659 \|df=dmy-all \|doi-access=free }}</ref>▼ ** [[Primality test]]ing and [[integer factorization]]<ref>{{cite web\|url=https://mersenne.org/various/works.php\|title=How GIMPS Works\|work=Great Internet Mersenne Prime Search\|access-date=6 March 2025}}</ref> * [[Computational finance]] ▲* [[Bioinformatics]]<ref>{{cite journal\|doi=10.1186/1471-2105-8-474\|pmid=18070356\|pmc=2222658\|title=High-throughput sequence alignment using Graphics Processing Units\|journal=BMC Bioinformatics\|volume=8\|~~pages~~article-number=474\|year=2007\|last1=Schatz\|first1=Michael C\|last2=Trapnell\|first2=Cole\|last3=Delcher\|first3=Arthur L\|last4=Varshney\|first4=Amitabh \|doi-access=free }}</ref><ref name=Manavski2008>{{cite journal \|author=Svetlin A. Manavski \|author2=Giorgio Valle \|title=CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment \|journal=BMC Bioinformatics \|volume=9 \|issue=Suppl. 2 \|page=S10 \|date=2008 \|doi=10.1186/1471-2105-9-s2-s10 \|pmid=18387198 \|pmc=2323659 \|df=dmy-all \|doi-access=free }}</ref> * [[Medical imaging]] * [[Clinical decision support system]] (CDSS)<ref>{{cite journal\|last1=Olejnik\|first1=M\|last2=Steuwer\|first2=M\|last3=Gorlatch\|first3=S\|last4=Heider\|first4=D\|title=gCUP: rapid GPU-based HIV-1 co-receptor usage prediction for next-generation sequencing.\|journal=Bioinformatics\|date=15 November 2014\|volume=30\|issue=22\|pages=3272–3\|pmid=25123901\|doi=10.1093/bioinformatics/btu535\|doi-access=free}}</ref> Line 393 ⟶ 401: [[Advanced Simulation Library]] [[Physics processing unit]] (PPU) * {{Annotated link\|Vector processor}} * {{Annotated link\|Single instruction, multiple threads}} ==References== Line 398 ⟶ 408: == Further reading == * {{Cite journal \|last1=Owens \|first1=J.D. \|last2=Houston \|first2=M. \|last3=Luebke \|first3=D. \|last4=Green \|first4=S. \|last5=Stone \|first5=J.E. \|last6=Phillips \|first6=J.C. \|date=May 2008 \|title=GPU Computing ~~\|url=https://ieeexplore.ieee.org/document/4490127~~ \|journal=Proceedings of the IEEE \|volume=96 \|issue=5 \|pages=879–899 \|doi=10.1109/JPROC.2008.917757 \|s2cid=17091128 \|issn=0018-9219}} * {{Cite journal \|last1=Brodtkorb \|first1=André R. \|last2=Hagen \|first2=Trond R. \|last3=Sætra \|first3=Martin L. \|date=2013-01-01 \|title=Graphics processing unit (GPU) programming strategies and trends in GPU computing \|url=https://www.sciencedirect.com/science/article/pii/S0743731512000998 \|journal=Journal of Parallel and Distributed Computing \|series=Metaheuristics on GPUs \|volume=73 \|issue=1 \|pages=4–13 \|doi=10.1016/j.jpdc.2012.04.003 \|issn=0743-7315\|hdl=10852/40283 \|hdl-access=free }}