General-purpose computing on graphics processing units: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 02:14, 2 September 2022 edit Citation bot (talk \| contribs) Bots 5,869,755 edits Alter: pages. Add: bibcode, arxiv, authors 1-1. Removed proxy/dead URL that duplicated identifier. Removed parameters. Formatted dashes. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Abductive \| #UCB_webform 2634/3850 ← Previous edit		Latest revision as of 10:11, 22 August 2025 edit undo KylieTastic (talk \| contribs) Autopatrolled, Administrators 521,428 edits no longer true
(102 intermediate revisions by 45 users not shown)
Line 1: {{Short description\|Use of a GPU for computations typically assigned to CPUs}} {{Use dmy dates\|date=January 2015}}▼ ▲{{Use dmy dates\|date=January 2015}} {{More citations needed\|date=February 2022}} '''General-purpose computing on graphics processing units''' ('''GPGPU''', or less often '''GPGP''') is the use of a [[graphics processing unit]] (GPU), which typically handles computation only for [[computer graphics]], to perform computation in applications traditionally handled by the [[central processing unit]] (CPU).<ref>~~[http://www.eyetap.org/papers/docs/iswc02-fung.pdf~~{{Cite conference \|last1=Fung, et\|first1=James ~~al.,~~\|last2=Tang "\|first2=Felix \|last3=Mann \|first3=Steve \|date=7–10 October 2002 \|title=Mediated Reality Using Computer Graphics Hardware for Computer Vision"] ~~{{webarchive~~\|url=~~https://web.archive.org/web/20120402173637/~~http://www.eyetap.org/~~~fungja~~papers/docs/~~glorbits_final~~iswc02-fung.pdf \|~~date~~conference=~~2 April 2012 }},~~ Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002), \|___location=Seattle, Washington, USA, ~~7–10~~\|pages=83–89 ~~October~~\|archive-url=https://web.archive.org/web/20120402173637/http://www.eyetap.org/~fungja/glorbits_final.pdf ~~2002,~~\|archive-date=2 ~~pp.~~April ~~83–89.~~2012}}</ref><ref name="Aimone">~~[http~~{{cite journal \| url=https://~~citeseerx~~link.~~ist~~springer.~~psu.edu~~com/~~viewdoc~~article/~~download?~~10.1007/s00779-003-0239-6 \| doi=10.~~1.1.580.6175&rep=rep1&type=pdf~~1007/s00779-003-0239-6 \| title=An ~~EyeTap~~Eye ''Tap'' video-based featureless projective motion estimation assisted by gyroscopic tracking for [[wearable computer]] mediated reality], ~~ACM~~\| year=2003 \| last1=Aimone \| first1=Chris \| last2=Fung \| first2=James \| last3=Mann \| first3=Steve \| journal=Personal and Ubiquitous Computing ~~published~~\| byvolume=7 ~~Springer~~\| ~~Verlag,~~issue=5 ~~Vol.7,~~\| ~~Iss.~~pages=236–248 3,\| ~~2003.~~s2cid=25168728 \| url-access=subscription }}</ref><ref>[http://www.eyetap.org/papers/docs/procicassp2004.pdf "Computer Vision Signal Processing on Graphics Processing Units", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004)] {{webarchive\|url=https://web.archive.org/web/20110819000326/http://www.eyetap.org/papers/docs/procicassp2004.pdf \|date=19 August 2011 }}: Montreal, Quebec, Canada, 17–21 May 2004, pp. V-93 – V-96</ref><ref>Chitty, D. M. (2007, July). [https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf A data parallel approach to genetic programming using programmable graphics hardware] {{webarchive\|url=https://web.archive.org/web/20170808190114/https://www.cs.york.ac.uk/rts/docs/GECCO_2007/docs/p1566.pdf \|date=8 August 2017 }}. In Proceedings of the 9th annual conference on Genetic and evolutionary computation (pp. 1566-1573). ACM.</ref> The use of multiple [[video card]]s in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.<ref>[http://eyetap.org/papers/docs/procicpr2004.pdf "Using Multiple Graphics Cards as a General Purpose Parallel Computer: Applications to Computer Vision", Proceedings of the 17th International Conference on Pattern Recognition (ICPR2004)] {{webarchive\|url=https://web.archive.org/web/20110718193841/http://eyetap.org/papers/docs/procicpr2004.pdf \|date=18 July 2011 }}, Cambridge, United Kingdom, 23–26 August 2004, volume 1, pages 805–808.</ref> Essentially, a GPGPU [[graphics pipeline\|pipeline]] is a kind of [[Parallel computing\|parallel processing]] between one or more GPUs and CPUs, ~~that~~with ~~analyzes~~special ~~data~~accelerated ~~as if it~~instructions ~~were~~for inprocessing image or other graphic ~~form~~forms of data. While GPUs operate at lower frequencies, they typically have many times the number of [[~~Multi-core~~Single ~~processor~~instruction, multiple threads\|~~cores~~Processing elements]]. Thus, GPUs can process far more pictures and other graphical data per second than a traditional CPU. Migrating data into ~~graphical~~parallel form and then using the GPU to ~~scan and analyze~~process it can (theoretically) create a large [[speedup]]. GPGPU pipelines were developed at the beginning of the 21st century for [[graphics processing]] (e.g. for better [[shader]]s). ~~These~~From ~~pipelines~~the ~~were~~[[history ~~found~~of tosupercomputing]] ~~fit~~it is well-known that [[scientific computing]] ~~needs~~drives ~~well,~~the ~~and~~largest ~~have~~concentrations ~~since~~of ~~been~~Computing ~~developed~~power in ~~this~~history, ~~direction~~listed in the [[TOP500]]: the majority today utilize [[GPU]]s. The best-known GPGPUs are [[Nvidia Tesla]] that are used for [[Nvidia DGX]], alongside [[AMD Instinct]] and Intel Gaudi. ==History== In principle, any arbitrary [[~~boolean~~Boolean function]], including addition, multiplication, and other mathematical functions, can be built up from a [[functional completeness\|functionally complete]] set of logic operators. In 1987, [[Conway's Game of Life]] became one of the first examples of general-purpose computing using an early [[stream processing\|stream processor]] called a [[blitter]] to invoke a special sequence of [[bit blit\|logical operations]] on bit vectors.<ref>{{cite journal\|last=Hull\|first=Gerald\|title=LIFE\|journal=Amazing Computing\|volume=2\|issue=12\|pages=81–84\|date=December 1987\|url=https://archive.org/stream/amazing-computing-magazine-1987-12/Amazing_Computing_Vol_02_12_1987_Dec#page/n81/mode/2up}}</ref> General-purpose computing on GPUs became more practical and popular after about 2001, with the advent of both programmable [[shader]]s and [[floating point]] support on graphics processors. Notably, problems involving [[matrix (mathematics)\|matrices]] and/or [[vector (mathematics and physics)\|vector]]s{{snd}} especially two-, three-, or four-dimensional vectors{{snd}} were easy to translate to a GPU, which acts with native speed and support on those types. A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs.<ref>{{Cite journal \|last1=Krüger \|first1=Jens \|last2=Westermann \|first2=Rüdiger \|date=July 2003 \|title=Linear algebra operators for GPU implementation of numerical algorithms \|url=https://dl.acm.org/doi/10.1145/882262.882363 \|journal=ACM Transactions on Graphics \|language=en \|volume=22 \|issue=3 \|pages=908–916 \|doi=10.1145/882262.882363 \|issn=0730-0301\|url-access=subscription }}</ref><ref>{{Cite journal \|last1=Bolz \|first1=Jeff \|last2=Farmer \|first2=Ian \|last3=Grinspun \|first3=Eitan \|last4=Schröder \|first4=Peter \|date=July 2003 \|title=Sparse matrix solvers on the GPU: conjugate gradients and multigrid \|url=https://dl.acm.org/doi/10.1145/882262.882364 \|journal=ACM Transactions on Graphics \|language=en \|volume=22 \|issue=3 \|pages=917–924 \|doi=10.1145/882262.882364 \|issn=0730-0301\|url-access=subscription }}</ref> These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives, as supported by the two major APIs for graphics processors, [[OpenGL]] and [[DirectX]]. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such as [[Lib Sh\|Sh]]/[[RapidMind]], [[BrookGPU\|Brook]] and Accelerator.<ref>{{cite journal \|last1=Tarditi \|first1=David \|first2=Sidd \|last2=Puri \|first3=Jose \|last3=Oglesby \|title=Accelerator: using data parallelism to program GPUs for general-purpose uses \|journal=ACM SIGARCH Computer Architecture News \|volume=34 \|issue=5 \|date=2006\|url=https://www.cs.cmu.edu/afs/cs/academic/class/15740-f07/public/discussion-papers/26-tarditi-asplos06.pdf\|doi=10.1145/1168919.1168898 }}</ref><ref>{{cite journal \|last1=Che \|first1=Shuai \|last2=Boyer \|first2=Michael \|last3=Meng \|first3=Jiayuan \|last4=Tarjan \|first4=D. \|last5=Sheaffer \|first5=Jeremy W. \|last6=Skadron \|first6=Kevin \|title=A performance study of general-purpose applications on graphics processors using CUDA \|journal=J. Parallel and Distributed Computing \|volume=68 \|issue=10 \|date=2008 \|pages=1370–1380 \|doi=10.1016/j.jpdc.2008.05.014 \|df=dmy-all \|citeseerx=10.1.1.143.4849 }}</ref><ref>{{cite journal \|last1=Glaser \|first1=J. \|last2=Nguyen \|first2=T. D. \|last3=Anderson \|first3=J. A. \|last4=Lui \|first4=P. \|last5=Spiga \|first5=F. \|last6=Millan \|first6=J. A. \|last7=Morse \|first7=D. C. \|last8=Glotzer \|first8=S. C. \|date=2015 \|title=Strong scaling of general-purpose molecular dynamics simulations on GPUs \|journal=Computer Physics Communications \|volume=192 \|pages=97–107 \| doi=10.1016/j.cpc.2015.02.028\|arxiv=1412.3387 \|bibcode=2015CoPhC.192...97G \| doi-access=free}}</ref> These were followed by Nvidia's [[CUDA]], which allowed programmers to ignore the underlying graphical concepts in favor of more common [[high-performance computing]] concepts.<ref name="du">{{Cite journal \|doi= 10.1016/j.parco.2011.10.002 \|title= From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming \|journal= Parallel Computing \|volume= 38 \|issue= 8 \|pages= 391–407 \|year= 2012 \|last1= Du \|first1= Peng \|last2= Weber \|first2= Rick \|last3= Luszczek \|first3= Piotr \|last4= Tomov \|first4= Stanimire \|last5= Peterson \|first5= Gregory \|last6= Dongarra \|first6= Jack \|author-link6= Jack Dongarra \|df= dmy-all \|citeseerx= 10.1.1.193.7712 }}</ref> Newer, hardware-vendor-independent offerings include Microsoft's [[DirectCompute]] and Apple/Khronos Group's [[OpenCL]].<ref name="du"/> This means that modern GPGPU pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form. [[Mark Harris ~~(programmer)\|Mark Harris]]~~, the founder of GPGPU.org, claims he coined the term ''GPGPU''.<ref>{{cite web\|url=https://developer.nvidia.com/blog/even-easier-introduction-cuda\|title=An Even Easier Introduction to CUDA\|last=Harris\|first=Mark\|date=2017-01-25\|work=Nvidia\|access-date=16 February 2025}}</ref> ==Implementations== ===Software libraries and APIs=== ~~Any language that allows the code running on the CPU to poll a GPU [[shader]] for return values, can create a GPGPU framework.<br>~~ Any language that allows the code running on the CPU to poll a GPU [[shader]] for return values, can create a GPGPU framework. Programming standards for parallel computing include [[OpenCL]] (vendor-independent), [[OpenACC]], [[OpenMP]] and [[OpenHMPP]]. {{As of\|2016}}, OpenCL is the dominant open general-purpose GPU computing language, and is an open standard defined by the [[Khronos Group]].{{Citation needed\|date=September 2020\|reason=No source for claim of dominance at the time and possibly very outdated now}} OpenCL provides a [[cross-platform]] GPGPU platform that additionally supports data parallel compute on CPUs. OpenCL is actively supported on Intel, AMD, Nvidia, and ARM platforms. The Khronos Group has also standardised and implemented [[SYCL]], a higher-level programming model for [[OpenCL]] as a single-source ___domain specific embedded language based on pure C++11. Line 28 ⟶ 30: The dominant proprietary framework is [[Nvidia]] [[CUDA]].<ref>{{cite web \|url=http://www.hpcwire.com/hpcwire/2012-02-28/opencl_gains_ground_on_cuda.html \|title=OpenCL Gains Ground on CUDA \|access-date=2012-04-10 \|url-status=live \|archive-url=https://web.archive.org/web/20120423060057/http://www.hpcwire.com/hpcwire/2012-02-28/opencl_gains_ground_on_cuda.html \|archive-date=23 April 2012 \|df=dmy-all \|date=2012-02-28 }} "As the two major programming frameworks for GPU computing, OpenCL and CUDA have been competing for mindshare in the developer community for the past few years."</ref> Nvidia launched CUDA in 2006, a [[software development kit]] (SDK) and [[application programming interface]] (API) that allows using the programming language [[C (programming language)\|C]] to code algorithms for execution on [[GeForce 8 series]] and later GPUs. [[ROCm]], launched in 2016, is AMD's open-source response to CUDA. It is, as of 2022, on par with CUDA with regards to features,{{source?\|date=December 2024}} and still lacking in consumer support.{{source?\|date=December 2024}} ~~It is, as of 2022, on par with CUDA with regards to features, and still lacking in consumer support.~~ The ''{{visible anchor\|Xcelerit SDK}}'',<ref>{{cite web\|title=Xcelerit SDK\|url=https://www.xcelerit.com/products/xcelerit-sdk/\|website=XceleritSDK\|url-status=live\|archive-url=https://web.archive.org/web/20180308232059/https://www.xcelerit.com/products/xcelerit-sdk/\|archive-date=8 March 2018\|df=dmy-all\|date=2015-10-26}}</ref> created by [[Xcelerit]],<ref>{{cite web\|title=Home page\|url=https://www.xcelerit.com/\|website=Xcelerit\|url-status=live\|archive-url=https://web.archive.org/web/20180308232636/https://www.xcelerit.com/\|archive-date=8 March 2018\|df=dmy-all}}</ref> is designed to accelerate large existing [[C++]] or [[C Sharp (programming language)\|C#]] code-bases on [[GPUs]] with minimal effort. It provides a simplified programming model, automates parallelisation, manages devices and memory, and compiles to CUDA binaries. Additionally, multi-core [[CPUs]] and other accelerators can be targeted from the same source code. OpenVIDIA was developed at [[University of Toronto]] between 2003–2005,<ref name="Fung">{{cite book \| last1 = Fung \| first1 = James \| last2 = Mann \| first2 = Steve \| author-link2 = Steve Mann (inventor) \| last3 = Aimone \| first3 = Chris \| chapter = OpenVIDIA: Parallel GPU computer vision \| title = Proceedings of the 13th annual ACM international conference on Multimedia \| publication-date = 6 November 2005 \| date = 6–11 November 2005 \| isbn = 1595930442 \| publisher = [[Association for Computing Machinery]] \| ___location = Singapore \| doi = 10.1145/1101149.1101334 \| pages = 849–852 \| accessdate = 18 March 2025 \| chapter-url = http://www.eyetap.org/papers/docs/oss1-fung.pdf \| archive-url = https://web.archive.org/web/20191223164955/http://www.eyetap.org/papers/docs/oss1-fung.pdf \| archive-date = 23 December 2019}}</ref> in collaboration with Nvidia. ~~OpenVIDIA was developed at [[University of Toronto]] between 2003–2005,<ref name="Fung">~~ James Fung, Steve Mann, Chris Aimone, "[http://www.eyetap.org/papers/docs/oss1-fung.pdf OpenVIDIA: Parallel GPU Computer Vision]", Proceedings of the ACM Multimedia 2005, Singapore, 6–11 November 2005, pages 849–852</ref> in collaboration with Nvidia. Altimesh Hybridizer created by [[Altimesh]] compiles [[Common Intermediate Language]] to CUDA binaries.<ref>{{cite web\|title=Hybridizer\|url=http://www.altimesh.com/hybridizer-essentials/\|website=Hybridizer\|url-status=live\|archive-url=https://web.archive.org/web/20171017150337/http://www.altimesh.com/hybridizer-essentials/\|archive-date=17 October 2017\|df=dmy-all}}</ref><ref>{{cite web\|title=Home page\|url=http://www.altimesh.com/\|website=Altimesh\|url-status=live\|archive-url=https://web.archive.org/web/20171017145518/http://www.altimesh.com/\|archive-date=17 October 2017\|df=dmy-all}}</ref> It supports generics and virtual functions.<ref>{{cite web\|title=Hybridizer generics and inheritance\|url=http://www.altimesh.com/generics-and-inheritance/\|url-status=live\|archive-url=https://web.archive.org/web/20171017145927/http://www.altimesh.com/generics-and-inheritance/\|archive-date=17 October 2017\|df=dmy-all\|date=2017-07-27}}</ref> Debugging and profiling is integrated with [[Visual Studio]] and Nsight.<ref>{{cite web\|title=Debugging and Profiling with Hybridizer\|url=http://www.altimesh.com/debugging-and-profiling/\|url-status=live\|archive-url=https://web.archive.org/web/20171017201449/http://www.altimesh.com/debugging-and-profiling/\|archive-date=17 October 2017\|df=dmy-all\|date=2017-06-05}}</ref> It is available as a Visual Studio extension on Visual Studio Marketplace. [[Microsoft]] introduced the [[DirectCompute]] GPU computing API, released with the [[DirectX]] 11]] API. ''{{visible anchor\|Alea GPU}}'',<ref>{{cite web\|title=Introduction\|url=http://www.aleagpu.com/release/3_0_2/doc/\|website=Alea GPU\|access-date=15 December 2016\|url-status=live\|archive-url=https://web.archive.org/web/20161225051728/http://www.aleagpu.com/release/3_0_2/doc/\|archive-date=25 December 2016\|df=dmy-all}}</ref> created by QuantAlea,<ref>{{cite web\|title=Home page\|url=http://www.quantalea.com/\|website=Quant Alea\|access-date=15 December 2016\|url-status=live\|archive-url=https://web.archive.org/web/20161212112729/http://www.quantalea.com/\|archive-date=12 December 2016\|df=dmy-all}}</ref> introduces native GPU computing capabilities for the Microsoft .NET ~~language~~languages [[F Sharp (programming language)\|F#]]<ref>{{cite web\|title=Use F# for GPU Programming\|url=http://fsharp.org/use/gpu/\|publisher=F# Software Foundation\|access-date=15 December 2016\|url-status=dead\|archive-url=https://web.archive.org/web/20161218090254/http://fsharp.org/use/gpu/\|archive-date=18 December 2016\|df=dmy-all}}</ref> and [[C Sharp (programming language)\|C#]]. Alea GPU also provides a simplified GPU programming model based on GPU parallel-for and parallel aggregate using delegates and automatic memory management.<ref>{{cite web \| url=http://www.quantalea.com/features \| website=Quant Alea \| title=Alea GPU Features \| access-date=15 December 2016 \| url-status=live \| archive-url=https://web.archive.org/web/20161221090440/http://quantalea.com/features/ \| archive-date=21 December 2016 \| df=dmy-all }}</ref> [[MATLAB]] supports GPGPU acceleration using the ''Parallel Computing Toolbox'' and ''MATLAB Distributed Computing Server'',<ref>{{cite web\|title=MATLAB Adds GPGPU Support\|url=http://www.hpcwire.com/features/MATLAB-Adds-GPGPU-Support-103307084.html\|date=20 September 2010\|url-status=dead\|archive-url=https://web.archive.org/web/20100927155948/http://www.hpcwire.com/features/MATLAB-Adds-GPGPU-Support-103307084.html\|archive-date=27 September 2010\|df=dmy-all}}</ref> and third-party packages like [[Jacket (software)\|Jacket]]. Line 51 ⟶ 49: Due to a trend of increasing power of mobile GPUs, general-purpose programming became available also on the mobile devices running major [[mobile operating system]]s. [[Google]] [[Android (operating system)\|Android]] 4.2 enabled running [[RenderScript]] code on the mobile device GPU.<ref>{{cite web\|url=http://developer.android.com/about/versions/android-4.2.html\|title=Android 4.2 APIs - Android Developers\|website=developer.android.com\|url-status=live\|archive-url=https://web.archive.org/web/20130826191621/http://developer.android.com/about/versions/android-4.2.html\|archive-date=26 August 2013\|df=dmy-all}}</ref> Renderscript has since been deprecated in favour of first OpenGL compute shaders<ref>{{cite web \| url=https://developer.android.com/guide/topics/renderscript/migrate/migrate-gles \| title=Migrate scripts to OpenGL ES 3.1 }}</ref> and later Vulkan Compute.<ref>{{cite web \| url=https://developer.android.com/guide/topics/renderscript/migrate/migrate-vulkan \| title=Migrate scripts to Vulkan }}</ref> OpenCL is available on many Android devices, but is not officially supported by Android.<ref>{{cite web\|url=https://khronos.org/blog/catching-up-with-khronos-experts-qa-on-opencl-3.0-and-sycl-2020\|title=Catching Up with Khronos: Experts' Q&A on OpenCL 3.0 and SYCL 2020\|last=McIntosh-Smith\|first=Simon\|date=2020-07-15\|publisher=The Khronos Group\|access-date=16 February 2025}}</ref> [[Apple Inc.\|Apple]] introduced the proprietary [[Metal (API)\|Metal]] API for [[iOS]] applications, able to execute arbitrary code through Apple's GPU compute shaders.{{fact\|date=June 2024}} ==Hardware support== Line 57 ⟶ 55: ===Integer numbers=== Pre-DirectX 9 video cards only supported [[Palette (computing)\|paletted]] or integer color types. ~~Various formats are available, each containing a red element, a green element, and a blue element.{{citation needed\|date=February 2007}}~~ Sometimes another alpha value is added, to be used for transparency.<!-- What about alpha? what about RG or R formats? Are we documenting texture formats or computing (always 4D)? "Transparency is graphics specific! Must it be cited? --> Common formats are: * 8 bits per pixel – Sometimes palette mode, where each value is an index in a table with the real color value specified in one of the other formats. Sometimes three bits for red, three bits for green, and two bits for blue. Line 65 ⟶ 63: ===Floating-point numbers=== For early [[fixed-function]] or limited programmability graphics (i.e., up to and including DirectX 8.1-compliant GPUs) this was sufficient because this is also the representation used in displays. ~~It is important to note that this~~This representation does have certain limitations. Given sufficient graphics processing power even graphics programmers would like to use better formats, such as [[floating point]] data formats, to obtain effects such as [[high-dynamic-range imaging]]. Many GPGPU applications require floating point accuracy, which came with video cards conforming to the DirectX 9 specification. DirectX 9 Shader Model 2.x suggested the support of two precision types: full and partial precision. Full precision support could either be FP32 or FP24 (floating point 32- or 24-bit per component) or greater, while partial precision was FP16. [[ATI Technologies\|ATI's]] [[Radeon R300]] series of GPUs supported FP24 precision only in the programmable fragment pipeline (although FP32 was supported in the vertex processors) while [[Nvidia]]'s [[GeForce FX\|NV30]] series supported both FP16 and FP32; other vendors such as [[S3 Graphics]] and [[XGI Technology\|XGI]] supported a mixture of formats up to FP24. The implementations of floating point on Nvidia GPUs are mostly [[IEEE floating-point standard\|IEEE]] compliant; however, this is not true across all vendors.<ref name="nVidiaIsIEEE">~~[http~~{{cite book \| chapter-url=https://~~doi~~dl.acm.org/doi/10.1145/1198555.1198768 ~~Mapping~~\| ~~computational~~doi=10.1145/1198555.1198768 ~~concepts to GPUs]: Mark Harris.~~\| chapter=Mapping computational concepts to GPUs. In\| title=ACM SIGGRAPH 2005 Courses ~~(Los~~on ~~Angeles,~~- ~~California,~~SIGGRAPH 31'05 ~~July{{snd}} 4 August~~\| year=2005). J.\| ~~Fujii,~~last1=Harris ~~Ed.~~\| ~~SIGGRAPH~~first1=Mark ~~'05.~~\| ~~ACM~~pages=50–es ~~Press,~~\| ~~New~~isbn=9781450378338 ~~York,~~\| ~~NY,~~s2cid=8212423 ~~50.~~}}</ref><!-- It doesn't match even with Intel and AMD. It's just OK for FP. --> This has implications for correctness which are considered important to some scientific applications. While 64-bit floating point values (double precision float) are commonly available on CPUs, these are not universally supported on GPUs. Some GPU architectures sacrifice IEEE compliance, while others lack double-precision. Efforts have occurred to emulate double-precision floating point values on GPUs; however, the speed tradeoff negates any benefit to offloading the computing onto the GPU in the first place.<ref name="doublePrecisionOnGPU">[http://www.mathematik.tu-dortmund.de/papers/GoeddekeStrzodkaTurek2005.pdf Double precision on GPUs (Proceedings of ASIM 2005)] {{webarchive\|url=https://web.archive.org/web/20140821160055/http://www.mathematik.tu-dortmund.de/papers/GoeddekeStrzodkaTurek2005.pdf \|date=21 August 2014 }}: Dominik Goddeke, Robert Strzodka, and Stefan Turek. Accelerating Double Precision (FEM) Simulations with (GPUs). Proceedings of ASIM 2005{{snd}} 18th Symposium on Simulation Technique, 2005.</ref> ===Vectorization=== {{See also\|Vector_processor#GPU_vector_processing_features\|SIMD\|SWAR\|Single instruction, multiple threads{{!}}SIMT}} {{Unreferenced section\|date=July 2017}} Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once.{{Disputed inline\|date=July 2025}} For example, if one color {{angbr\|R1, G1, B1}} is to be modulated by another color {{angbr\|R2, G2, B2}}, the GPU can produce the resulting color {{angbr\|R1R2, G1G2, B1B2}} in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional).{{citation needed\|date=July 2017}} Examples include vertices, colors, normal vectors, and texture coordinates. Many other applications can put this to good use, and because of their higher performance, vector instructions, termed single instruction, multiple data ([[Single instruction, multiple data\|SIMD]]), have long been available on CPUs.{{citation needed\|date=July 2017}} ==GPU vs. CPU== {{Original research section\|date=February 2015}} {{Unreferenced section\|date=July 2017}} Originally, data was simply passed one-way from a [[central processing unit]] (CPU) to a [[graphics processing unit]] (GPU), then to a [[display device]]. As time progressed, however, it became valuable for GPUs to store at first simple, then complex structures of data to be passed back to the CPU that analyzed an image, or a set of scientific-data represented as a 2D or 3D format that a video card can understand. Because the GPU has access to every draw operation, it can analyze data in these forms quickly, whereas a CPU must poll every pixel or data element much more slowly, as the speed of access between a CPU and its larger pool of [[random-access memory]] (or in an even worse case, a [[hard drive]]) is slower than GPUs and video cards, which typically contain smaller amounts of more expensive memory that is much faster to access. Transferring the portion of the data set to be actively analyzed to that GPU memory in the form of textures or other easily readable GPU forms results in speed increase. The distinguishing feature of a GPGPU design is the ability to transfer information [[Duplex (telecommunications)\|bidirectionally]] back from the GPU to the CPU; generally the data throughput in both directions is ideally high, resulting in a [[multiplier (coefficient)\|multiplier]] effect on the speed of a specific high-use [[algorithm]]. GPGPU pipelines may improve efficiency on especially large data sets and/or data containing 2D or 3D imagery. It is used in complex graphics pipelines as well as [[scientific computing]]; more so in fields with large data sets like [[genome mapping]], or where two- or three-dimensional analysis is useful{{snd}} especially at present [[biomolecule]] analysis, [[protein]] study, and other complex [[organic chemistry]]. An example of such applications is [[NVIDIA Parabricks\|NVIDIA software suite for genome analysis]]. Such pipelines can also vastly improve efficiency in [[image processing]] and [[computer vision]], among other fields; as well as [[Parallel computing\|parallel processing]] generally. Some very heavily optimized pipelines have yielded speed increases of several hundred times the original CPU-based pipeline on one high-use task. A simple example would be a GPU program that collects data about average [[lighting]] values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use [[edge detection]] to return both numerical information and a processed image representing outlines to a [[computer vision]] program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every [[pixel]] or other picture element in an image, it can analyze and average it (for the first example) or apply a [[Sobel operator\|Sobel edge filter]] or other [[convolution]] filter (for the second) with much greater speed than a CPU, which typically must access slower [[random-access memory]] copies of the graphic in question. GPGPU ~~is fundamentally~~as a software concept~~, not a hardware concept; it~~ is a type of [[algorithm]], not a piece of equipment. Specialized equipment designs may, however, even further enhance the efficiency of GPGPU pipelines, which traditionally perform relatively few algorithms on very large amounts of data. Massively parallelized, gigantic-data-level tasks thus may be parallelized even further via specialized setups such as rack computing (many similar, highly tailored machines built into a ''rack''), which adds a third layer{{snd}} many computing units each using many CPUs to correspond to many GPUs. Some [[Bitcoin]] "miners" used such setups for high-quantity processing. Insights into the largest such systems in the world has been maintained at the [[TOP500]] supercomputer list. ===Caches=== Line 89 ⟶ 92: ===Register file=== GPUs have very large [[Register file\|register files]], which allow them to reduce context-switching latency. Register file size is also increasing over different GPU generations, e.g., the total register file size on Maxwell (GM200), Pascal and Volta GPUs are 6 MiB, 14 MiB and 20 MiB, respectively.<ref>"[https://devblogs.nvidia.com/parallelforall/inside-pascal/ Inside Pascal: Nvidia’s Newest Computing Platform] {{webarchive\|url=https://web.archive.org/web/20170507110037/https://devblogs.nvidia.com/parallelforall/inside-pascal/ \|date=7 May 2017 }}"</ref><ref>"[https://devblogs.nvidia.com/inside-volta/ Inside Volta: The World’s Most Advanced Data Center GPU] {{webarchive\|url=https://web.archive.org/web/20200101171030/https://devblogs.nvidia.com/inside-volta/ \|date=1 January 2020 }}"</ref> By comparison, the size of a [[Processor register\|register file on CPUs]] is small, typically tens or hundreds of kilobytes. In essence: almost all GPU workloads are inherently massively-parallel LOAD-COMPUTE-STORE in nature, such as [[Tiled rendering]]. Even storing one temporary vector for further recall (LOAD-COMPUTE-STORE-COMPUTE-LOAD-COMPUTE-STORE) is so expensive due to the [[Random-access_memory#Memory_wall\|Memory wall]] problem that it is to be avoided at all costs.<ref>{{cite book \| last1=Li \| first1=Jie \| last2=Michelogiannakis \| first2=George \| last3=Cook \| first3=Brandon \| last4=Cooray \| first4=Dulanya \| last5=Chen \| first5=Yong \| title=High Performance Computing \| chapter=Analyzing Resource Utilization in an HPC System: A Case Study of NERSC's Perlmutter \| series=Lecture Notes in Computer Science \| date=2023 \| volume=13948 \| pages=297–316 \| doi=10.1007/978-3-031-32041-5_16 \| isbn=978-3-031-32040-8 \| chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-32041-5_16 }}</ref> The result is that register file size ''has'' to increase. In standard CPUs it is possible to introduce [[Cache (computing)\|caches]] (a [[D-cache]]) to solve this problem, however these are relativrly so large that they are impractical to introduce in GPUs which would need one per Processing Element. [[ILLIAC IV]] innovatively solved the problem around 1967 by introducing a local memory per Processing Element (a PEM): a strategy copied by the [[Flynn%27s_taxonomy#Associative_processor\|Aspex ASP]]. ===Energy efficiency=== The high performance of GPUs comes at the cost of high power consumption, which under full load is in fact as much power as the rest of the PC system combined.<ref>"https://www.tomshardware.com/reviews/geforce-radeon-power,2122.html How Much Power Does Your Graphics Card Need?"</ref> The maximum power consumption of the Pascal series GPU (Tesla P100) was specified to be 250W.<ref>"https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf Nvidia Tesla P100 GPU Accelerator {{webarchive\|url=https://web.archive.org/web/20180724140610/https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf \|date=24 July 2018 }}"</ref> == Classical GPGPU == Before CUDA was published in 2007, GPGPU was "classical" and involved repurposing graphics primitives. A standard structure of such was: # Load arrays into textures # Draw a quadrangle # Apply pixel shaders and textures to quadrangle # Read out pixel values in the quadrangle as array More examples are available in part 4 of ''GPU Gems 2''.<ref>{{Cite book \|url=https://developer.nvidia.com/gpugems/gpugems2/ \|title=GPU gems 2: programming techniques for high-performance graphics and general-purpose computation \|date=2006 \|publisher=Addison-Wesley \|isbn=978-0-321-33559-3 \|editor-last=Pharr \|editor-first=Matt \|edition=3. print \|___location=Upper Saddle River, NJ Munich \|chapter=Part IV: General-Purpose Computation on GPUS: A Primer \|chapter-url=https://developer.nvidia.com/gpugems/gpugems2/part-iv-general-purpose-computation-gpus-primer}}</ref> === Linear algebra === Using GPU for numerical linear algebra began at least in 2001.<ref>{{Cite book \|last1=Larsen \|first1=E. Scott \|last2=McAllister \|first2=David \|chapter=Fast matrix multiplies using graphics hardware \|date=2001-11-10 \|title=Proceedings of the 2001 ACM/IEEE conference on Supercomputing \|chapter-url=https://dl.acm.org/doi/10.1145/582034.582089 \|language=en \|publisher=ACM \|pages=55 \|doi=10.1145/582034.582089 \|isbn=978-1-58113-293-9}}</ref> It had been used for Gauss-Seidel solver, conjugate gradients, etc.<ref>{{Cite book \|last1=Krüger \|first1=Jens \|last2=Westermann \|first2=Rüdiger \|title=ACM SIGGRAPH 2005 Courses on - SIGGRAPH '05 \|date=2005 \|chapter=Linear algebra operators for GPU implementation of numerical algorithms \|chapter-url=http://portal.acm.org/citation.cfm?doid=1198555.1198795 \|language=en \|publisher=ACM Press \|pages=234 \|doi=10.1145/1198555.1198795}}</ref> ==Stream processing== Line 150 ⟶ 168: ====Flow control==== For accurate technical information on this topic see [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication]] and ILLIAC IV [[ILLIAC IV#Branches\|"branching"]] (the term "predicate mask" did not exist in 1967). In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs.<ref name="book">{{cite web\|url=https://developer.nvidia.com/gpugems/GPUGems2/gpugems2_chapter34.html\|title=GPU Gems – Chapter 34, GPU Flow-Control Idioms}}</ref><!--not really, branching could be zeroed out even on NV20, which gives roughly the same result--> Conditional writes could be performed using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible. Recent{{When\|date=July 2024}} GPUs allow branching, but usually with a performance penalty. Branching should generally be avoided in inner loops, whether in CPU or GPU code, and various methods, such as static branch resolution, pre-computation, predication, loop splitting,<ref name="Tutorial on eliminating branches">[https://web.archive.org/web/20110603193749/http://www.futurechips.org/tips-for-power-coders/basic-technique-to-help-branch-prediction.html Future Chips]. "Tutorial on removing branches", 2011</ref> and Z-cull<ref name="survey">[http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=907 GPGPU survey paper] {{webarchive\|url=https://web.archive.org/web/20070104090919/http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=907 \|date=4 January 2007 }}: John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and Tim Purcell. "A Survey of General-Purpose Computation on Graphics Hardware". Computer Graphics Forum, volume 26, number 1, 2007, pp. 80–113.</ref> can be used to achieve branching when hardware support does not exist. ===GPU methods=== Line 168 ⟶ 188: ====Scan==== The scan operation, also termed ''[[prefix sum#Parallel algorithm\|parallel prefix sum]]'', takes in a vector (stream) of data elements and an [[monoid\|(arbitrary) associative binary function '+' with an identity element 'i']]. If the input is [a0, a1, a2, a3, ...], an ''exclusive scan'' produces the output [i, a0, a0 + a1, a0 + a1 + a2, ...], while an ''inclusive scan'' produces the output [a0, a0 + a1, a0 + a1 + a2, a0 + a1 + a2 + a3, ...] and [[semigroup\|does not require an identity]] to exist. While at first glance the operation may seem inherently serial, efficient parallel scan algorithms are possible and have been implemented on graphics processing units. The scan operation has uses in e.g., quicksort and sparse matrix-vector multiplication.<ref name=goddeke2010 /><ref>{{cite web\|url=http://www.idav.ucdavis.edu/func/return_pdf?pub_id=915\|title=S. Sengupta, M. Harris, Y. Zhang, J. D. Owens, 2007. Scan primitives for GPU computing. In T. Aila and M. Segal (eds.): Graphics Hardware (2007).\|url-status=dead\|archive-url=https://web.archive.org/web/20150605081020/http://www.idav.ucdavis.edu/func/return_pdf?pub_id=915\|archive-date=5 June 2015\|df=dmy-all\|access-date=16 December 2014}}</ref><ref>{{cite journal \| last1 = Blelloch \| first1 = G. E. \| year = 1989 \| title = Scans as primitive parallel operations \| url = http://www.cs.berkeley.edu/~knight/cs267/papers/scan_primitive.pdf \| journal = IEEE Transactions on Computers \| volume = 38 \| issue = 11 \| pages = 1526–1538 \| doi = 10.1109/12.42122 \| url-status = dead \| archive-url = https://web.archive.org/web/20150923211604/http://www.cs.berkeley.edu/~knight/cs267/papers/scan_primitive.pdf \| archive-date = 23 September 2015 \| df = dmy-all \| access-date = 16 December 2014 }}</ref><ref>{{cite web\|url=~~http~~https://developer.nvidia.com/~~GPUGems3~~gpugems/~~gpugems3_ch39.html~~gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda\|title=M. Harris, S. Sengupta, J. D. Owens. Parallel Prefix Sum (Scan) with CUDA. In Nvidia: GPU Gems 3, Chapter 39.~~}}{{dead link\|date=April 2018 \|bot=SheriffIsInTown \|fix-attempted=yes~~ }}</ref> ====Scatter==== Line 184 ⟶ 204: ====Search==== The search operation allows the programmer to find a given element within the stream, or possibly find neighbors of a specified element. ~~The GPU is not used to speed up~~Mostly the search ~~for an individual element, but instead is~~method used ~~to run multiple searches in parallel. {{citation needed\|date=February 2007}}<!-- I doubt this~~ is ~~true~~[[binary insearch]] ~~general.~~on ~~The~~sorted ~~bandwidth, and compute power is considerably higher~~elements. ~~-->~~ ~~Mostly the search method used is [[binary search]] on sorted elements.~~ ====Data structures==== Line 197 ⟶ 216: The following are some of the areas where GPUs have been used for general purpose computing: [[Automatic parallelization]]<ref>Leung, Alan, Ondřej Lhoták, and Ghulam Lashari. "[https://cormack.uwaterloo.ca/~olhotak/pubs/pppj09.pdf Automatic parallelization for graphics processing units]." Proceedings of the 7th International Conference on Principles and Practice of Programming in Java. ACM, 2009.</ref><ref>Henriksen, Troels, Martin Elsman, and Cosmin E. Oancea. "[https://futhark-lang.org/publications/fhpc14.pdf Size slicing: a hybrid approach to size inference in futhark]." Proceedings of the 3rd ACM SIGPLAN workshop on Functional high-performance computing. ACM, 2014.</ref><ref>~~Baskaran,~~{{Cite ~~Muthu Manikandan, et al.~~book "[\|chapter-url=https://www.researchgate.net~~/profile/Ponnuswamy_Sadayappan~~/publication/~~221235428_A_compiler_framework_for_optimization_of_affine_loop_nests_for_GPGPUs~~221235428 \|doi=10.1145/~~links/09e4150e7f9714d1cf000000~~1375527.~~pdf~~ 1375562\|chapter=A compiler framework for optimization of affine loop nests for ~~GPGPUs]."~~gpgpus \|title=Proceedings of the 22nd annual international conference on Supercomputing. ~~ACM,~~- ICS '08 \|year=2008 \|last1=Baskaran \|first1=Muthu Manikandan \|last2=Bondhugula \|first2=Uday \|last3=Krishnamoorthy \|first3=Sriram \|last4=Ramanujam \|first4=J. \|last5=Rountev \|first5=Atanas \|last6=Sadayappan \|first6=P. \|page=225 \|isbn=9781605581583 \|s2cid=6137960 }}</ref> * [[Computational physics\|Physical based simulation]] and [[physics engine]]s<ref name="Joselli">~~Joselli,~~{{cite ~~Mark,~~book et\| ~~al. "[http~~chapter-url=https://~~www~~dl.~~academia~~acm.~~edu~~org/~~download~~doi/~~31203562~~10.1145/~~sandbox2008~~1401843.~~pdf~~1401871 \| doi=10.1145/1401843.1401871 \| chapter=A new physics engine with automatic process distribution between CPU-GPU~~]{{dead~~ ~~link~~\|~~date=July~~ ~~2022\|bot~~title=~~medic}}{{cbignore\|bot=medic}}."~~ Proceedings of the 2008 ACM SIGGRAPH symposium on Video games. ~~ACM,~~\| date=2008. \| last1=Joselli \| first1=Mark \| last2=Clua \| first2=Esteban \| last3=Montenegro \| first3=Anselmo \| last4=Conci \| first4=Aura \| last5=Pagliosa \| first5=Paulo \| pages=149–156 \| isbn=978-1-60558-173-6 }}</ref> (usually based on [[Newtonian physics]] models) ** [[Conway's Game of Life]], [[cloth simulation]], fluid [[incompressible flow]] by solution of [[Euler equations (fluid dynamics)]]<ref>{{cite web\|url=~~http~~https://developer.nvidia.com/~~GPUGems3~~gpugems/~~gpugems3_ch30.html~~gpugems3/part-v-physics-simulation/chapter-30-real-time-simulation-and-rendering-3d-fluids\|title=K. Crane, I. Llamas, S. Tariq, 2008. Real-Time Simulation and Rendering of 3D Fluids. In Nvidia: GPU Gems 3, Chapter 30.~~}}{{dead link\|date=April 2018 \|bot=SheriffIsInTown \|fix-attempted=yes~~ }}</ref> or [[Navier–Stokes equations]]<ref>{{cite web\|url=http://developer.nvidia.com/GPUGems/gpugems_ch38.html\|title=M. Harris, 2004. Fast Fluid Dynamics Simulation on the GPU. In Nvidia: GPU Gems, Chapter 38.\|work=NVIDIA Developer \|url-status=live\|archive-url=https://web.archive.org/web/20171007170306/https://developer.nvidia.com/GPUGems/gpugems_ch38.html\|archive-date=7 October 2017\|df=dmy-all}}</ref> * [[Statistical physics]] ** [[Ising model]]<ref>~~Block,~~{{cite ~~Benjamin,~~journal ~~Peter~~\| ~~Virnau,~~arxiv=1007.3726 ~~and~~\| ~~Tobias Preis~~doi=10. ~~"[https:~~1016/~~/arxiv~~j.~~org/abs/1007~~cpc.~~3726~~2010.05.005 \| title=Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model~~]."~~ \| year=2010 \| last1=Block \| first1=Benjamin \| last2=Virnau \| first2=Peter \| last3=Preis \| first3=Tobias \| journal=Computer Physics Communications \| volume=181. \| issue=9 ~~(2010):~~\| ~~1549-1556~~pages=1549–1556 \| bibcode=2010CoPhC.181.1549B \| s2cid=14828005 }}</ref> * [[Lattice gauge theory]]<ref>{{cite web\|url=https://indico.fnal.gov/event/22303/contributions/245806/attachments/157699/206544/SnowmassTalk.pdf\|title=New Computational Trends in Lattice Gauge Theory\|last=Boyle\|first=Peter\|publisher=Lawrence Berkeley National Laboratory\|access-date=16 February 2025}}</ref> * [[Lattice gauge theory]]{{citation needed\|date=May 2019}} * [[Segmentation (image processing)\|Segmentation]]{{snd}} 2D and 3D<ref>{{cite journal \| pmc=3657761 \| year=2011 \| last1=Sun, ~~Shanhui,~~\| ~~Christian~~first1=S. \| last2=Bauer, ~~and~~\| first2=C. ~~Reinhard~~\| last3=Beichel. ~~"[https://www.ncbi.nlm.nih~~\| first3=R.~~gov/pmc/articles/PMC3657761/~~ \| title=Automated 3-D ~~segmentation~~Segmentation of ~~lungs~~Lungs with ~~lung~~Lung ~~cancer~~Cancer in CT ~~data~~Data ~~using~~Using a ~~novel~~Novel ~~robust~~Robust ~~active~~Active ~~shape~~Shape ~~model~~Model ~~approach]."~~Approach \| journal=IEEE ~~transactions~~Transactions on ~~medical~~Medical ~~imaging~~Imaging \| volume=31. \| issue=2 ~~(2011):~~\| ~~449-460~~pages=449–460 \| doi=10.1109/TMI.2011.2171357 \| pmid=21997248 }}</ref> * [[Level set methods]] * [[Computed tomography\|CT]] reconstruction<ref>Jimenez, Edward S., and Laurel J. Orr. "[https://www.osti.gov/servlets/purl/1106909 Rethinking the union of computed tomography reconstruction and GPGPU computing]." Penetrating Radiation Systems and Applications XIV. Vol. 8854. International Society for Optics and Photonics, 2013.</ref> * [[Fast Fourier transform]]<ref>~~Sørensen,~~{{Cite ~~Thomas~~journal ~~Sangild, et al. "[~~\|url=https://www.researchgate.net~~/profile/Karsten_Noe~~/publication/~~5462925_Accelerating_the_Nonequispaced_Fast_Fourier_Transform_on_Commodity_Graphics_Hardware/links/00b49518562fbb56db000000~~5462925 \|doi=10.1109/TMI.2007.909834 \|title=Accelerating- the- Nonequispaced- Fast- Fourier- Transform- on- Commodity- Graphics- Hardware~~.pdf~~ ~~Accelerating~~\|year=2008 ~~the~~\|last1=Sorensen ~~nonequispaced~~\|first1=T.S. ~~fast~~\|last2=Schaeffter ~~Fourier~~\|first2=T. ~~transform~~\|last3=Noe on\|first3=K.O. ~~commodity~~\|last4=Hansen ~~graphics hardware]~~\|first4=M.S." \|journal=IEEE Transactions on Medical Imaging \|volume=27. \|issue=4 ~~(2008):~~\|pages=538–547 ~~538-547~~\|pmid=18390350 \|bibcode=2008ITMI...27..538S \|s2cid=206747049 }}</ref> * GPU learning{{snd}} [[machine learning]] and [[data mining]] computations, e.g., with software BIDMach * [[k-nearest neighbor algorithm]]<ref>~~[https://arxiv.org/abs/~~{{cite arXiv \| eprint=0804.1448 ~~Fast~~\| ~~''k''-nearest~~last1=Garcia ~~neighbor~~\| ~~search~~first1=Vincent ~~using~~\| ~~GPU].~~last2=Debreuve In\| ~~Proceedings~~first2=Eric of\| ~~the~~last3=Barlaud ~~CVPR~~\| ~~Workshop~~first3=Michel on\| ~~Computer~~title=Fast ~~Vision~~k onNearest ~~GPU,~~Neighbor ~~Anchorage,~~Search ~~Alaska,~~using ~~USA,~~GPU ~~June~~\| year=2008. V.\| ~~Garcia and E~~class=cs.CV ~~Debreuve and M. Barlaud.~~}}</ref> * [[Fuzzy logic]]<ref>M.{{Cite ~~Cococcioni,~~book ~~R. Grasso, M. Rixen, ''[~~\|chapter-url=https://www.researchgate.net~~/profile/Marco_Cococcioni2~~/publication/~~224245725_Rapid_prototyping_of_high_performance_fuzzy_computing_applications_using_high_level_GPU_programming_for_maritime_operations_support~~224245725 \|doi=10.1109/~~links/5b55ae9745851507a7c0bd5c/Rapid-prototyping-of-high-performance-fuzzy-computing-applications-using-high-level-GPU-programming-for-maritime-operations-support~~CISDA.2011.~~pdf~~5945947 \|chapter=Rapid prototyping of high performance fuzzy computing applications using high level GPU programming for maritime operations support~~]'',~~ in\|title=2011 ~~Proceedings~~IEEE ofSymposium on Computational Intelligence for Security ~~the~~and Defense Applications (CISDA) \|year=2011 \|last1=Cococcioni \|first1=Marco \|last2=Grasso \|first2=Raffaele \|last3=Rixen \|first3=Michel \|pages=17–23 \|isbn=978-1-4244-9939-7 \|s2cid=2089441 }}</ref> ~~IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), Paris, 11–15 April 2011</ref>~~ * [[Tone mapping]] * [[Audio signal processing]]<ref>{{Cite book \|last=Whalen, \|first=Sean. ~~"[http://citeseerx.ist.psu.edu/viewdoc/download?doi~~\|title=~~10.1.1.114.365&rep=rep1&type=pdf~~ Audio and the ~~graphics~~Graphics ~~processing~~Processing ~~unit]."~~Unit ~~Author~~\|date=March ~~report~~10, ~~University of California Davis 47 (~~2005): 51\|citeseerx=10.1.1.114.365}}</ref> Audio and sound effects processing, to use a [[GPU]] for [[digital signal processing]] (DSP) [[Analog signal processing]] Line 234 ⟶ 252: * Inverse [[discrete cosine transform]] (iDCT) * Variable-length decoding (VLD), [[Huffman coding]] * Inverse quantization ([[IQ]], (not to be confused bywith [[Intelligence Quotient)]]) * In-loop deblocking * Bitstream processing ([[CAVLC]]/[[CABAC]]) using special purpose hardware for this task because this is a serial task not suitable for regular GPGPU computation Line 252 ⟶ 270: [[Quantum mechanical]] physics ** [[Astrophysics]]<ref>{{cite web\|url=http://www.astro.lu.se/compugpu2010/\|title=Computational Physics with GPUs: Lund Observatory\|website=www.astro.lu.se\|url-status=live\|archive-url=https://web.archive.org/web/20100712062316/http://www.astro.lu.se/compugpu2010/\|archive-date=12 July 2010\|df=dmy-all}}</ref> * [[Number theory]] * [[Bioinformatics]]<ref>{{cite journal\|doi=10.1186/1471-2105-8-474\|pmid=18070356\|pmc=2222658\|title=High-throughput sequence alignment using Graphics Processing Units\|journal=BMC Bioinformatics\|volume=8\|pages=474\|year=2007\|last1=Schatz\|first1=Michael C\|last2=Trapnell\|first2=Cole\|last3=Delcher\|first3=Arthur L\|last4=Varshney\|first4=Amitabh}}</ref><ref name=Manavski2008>{{cite journal \|author=Svetlin A. Manavski \|author2=Giorgio Valle \|title=CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment \|journal=BMC Bioinformatics \|volume=9 \|issue=Suppl. 2 \|page=S10 \|date=2008 \|doi=10.1186/1471-2105-9-s2-s10 \|pmid=18387198 \|pmc=2323659 \|df=dmy-all }}</ref>▼ ** [[Primality test]]ing and [[integer factorization]]<ref>{{cite web\|url=https://mersenne.org/various/works.php\|title=How GIMPS Works\|work=Great Internet Mersenne Prime Search\|access-date=6 March 2025}}</ref> * [[Computational finance]] ▲* [[Bioinformatics]]<ref>{{cite journal\|doi=10.1186/1471-2105-8-474\|pmid=18070356\|pmc=2222658\|title=High-throughput sequence alignment using Graphics Processing Units\|journal=BMC Bioinformatics\|volume=8\|~~pages~~article-number=474\|year=2007\|last1=Schatz\|first1=Michael C\|last2=Trapnell\|first2=Cole\|last3=Delcher\|first3=Arthur L\|last4=Varshney\|first4=Amitabh \|doi-access=free }}</ref><ref name=Manavski2008>{{cite journal \|author=Svetlin A. Manavski \|author2=Giorgio Valle \|title=CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment \|journal=BMC Bioinformatics \|volume=9 \|issue=Suppl. 2 \|page=S10 \|date=2008 \|doi=10.1186/1471-2105-9-s2-s10 \|pmid=18387198 \|pmc=2323659 \|df=dmy-all \|doi-access=free }}</ref> * [[Medical imaging]] * [[Clinical decision support system]] (CDSS)<ref>{{cite journal\|last1=Olejnik\|first1=M\|last2=Steuwer\|first2=M\|last3=Gorlatch\|first3=S\|last4=Heider\|first4=D\|title=gCUP: rapid GPU-based HIV-1 co-receptor usage prediction for next-generation sequencing.\|journal=Bioinformatics\|date=15 November 2014\|volume=30\|issue=22\|pages=3272–3\|pmid=25123901\|doi=10.1093/bioinformatics/btu535\|doi-access=free}}</ref> Line 259 ⟶ 278: * [[Digital signal processing]] / [[signal processing]] * [[Control engineering]]{{citation needed\|date=May 2019}} * [[Operations research]]<ref>~~[http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6651078 GPU computing~~{{cite inbook ~~OR]~~\| ~~{{webarchive\|~~chapter-url=https~~://web.archive.org/web/20150113154533/http~~://ieeexplore.ieee.org/~~xpl~~document/~~articleDetails.jsp?arnumber=~~6651078 \|~~date=13~~ ~~January 2015 }} Vincent Boyer, Didier El Baz~~doi=10.1109/IPDPSW.2013.45 "\| chapter=Recent Advances on GPU Computing in Operations Research". \| title=2013 IEEE International Symposium on Parallel ~~and~~& Distributed Processing ~~Symposium~~, Workshops &and PhD Forum ~~(IPDPSW),~~\| year=2013 ~~IEEE~~\| ~~27th~~last1=Boyer ~~International,~~\| onfirst1=Vincent \| last2=El Baz \| first2=Didier \| pages=1778–1787 \| isbn=978-0-7695-4979-8 \| s2cid=2774188 \| url=https://hal.archives-ouvertes.fr/hal-01151607/file/4979b778.pdf ~~1778–1787~~}}</ref><ref>{{cite journal \|last1= Bukata \|first1= Libor \|last2= Sucha \|first2= Premysl \|last3= Hanzalek \|first3= Zdenek \|year= 2014 \|title= Solving the Resource Constrained Project Scheduling Problem using the parallel Tabu Search designed for the CUDA platform \|doi= 10.1016/j.jpdc.2014.11.005 \|journal= Journal of Parallel and Distributed Computing \|volume= 77\|pages= 58–68 \| arxiv= 1711.04556\|s2cid= 206391585 }}</ref><ref name=BaumeltZdenek>{{cite journal \|last1= Bäumelt \|first1= Zdeněk \|last2= Dvořák \|first2= Jan Line 270 ⟶ 289: \|pages=624–639}} </ref> ** Implementations of: the GPU Tabu Search algorithm solving the Resource Constrained Project Scheduling problem is freely available on GitHub;<ref>[https://github.com/CTU-IIG CTU-IIG] {{webarchive\|url=https://web.archive.org/web/20160109193106/https://github.com/CTU-IIG \|date=9 January 2016 }} Czech Technical University in Prague, Industrial Informatics Group (2015).</ref> the GPU algorithm solving the [[Nurse ~~Rerostering~~scheduling problem]] is freely available on GitHub.<ref>[https://github.com/CTU-IIG/NRRPGpu NRRPGpu] {{webarchive\|url=https://web.archive.org/web/20160109193106/https://github.com/CTU-IIG/NRRPGpu \|date=9 January 2016 }} Czech Technical University in Prague, Industrial Informatics Group (2015).</ref> * [[Neural network]]s * [[Database]] operations<ref>{{cite web \|url=http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-823/project/Final.pdf \|title=GPU-based Sorting in PostgreSQL \|author=Naju Mancheril \|work=School of Computer Science – Carnegie Mellon University \|url-status=live \|archive-url=https://www.webcitation.org/60dQHCPfS?url=http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-823/project/Final.pdf \|archive-date=2 August 2011 \|df=dmy-all }}</ref> * [[Computational Fluid Dynamics]] especially using [[Lattice Boltzmann methods]] * [[Cryptography]]<ref>Manavski, Svetlin A. "[https://koala.cs.pub.ro/redmine/attachments/download/1745/cuda_aes.pdf CUDA compatible GPU as an efficient hardware accelerator for AES cryptography] {{Webarchive\|url=https://web.archive.org/web/20190507205236/https://koala.cs.pub.ro/redmine/attachments/download/1745/cuda_aes.pdf \|date=7 May 2019 }}." 2007 IEEE International Conference on Signal Processing and Communications. IEEE, 2007.</ref> and [[cryptanalysis]] * Performance modeling: computationally intensive tasks on GPU<ref name="Hasan Khondker S. 2014 pp. 612–17"/> Implementations of: [[MD6]], [[Advanced Encryption Standard]] (AES),<ref>{{Cite book\|doi=10.1007/978-3-540-74735-2_15\|chapter=AES Encryption Implementation and Analysis on Commodity Graphics Processing Units\|title=Cryptographic Hardware and Embedded Systems - CHES 2007\|volume=4727\|pages=209\|series=Lecture Notes in Computer Science\|year=2007\|last1=Harrison\|first1=Owen\|last2=Waldron\|first2=John\|isbn=978-3-540-74734-5\|df=dmy-all\|citeseerx=10.1.1.149.7643}}</ref><ref>[http://www.usenix.org/events/sec08/tech/harrison.html AES and modes of operations on SM4.0 compliant GPUs.] {{webarchive\|url=https://web.archive.org/web/20100821131630/http://www.usenix.org/events/sec08/tech/harrison.html \|date=21 August 2010 }} Owen Harrison, John Waldron, Practical Symmetric Key Cryptography on Modern Graphics Hardware. In proceedings of USENIX Security 2008.</ref> [[Data Encryption Standard]] (DES), [[RSA (algorithm)\|RSA]],<ref>{{Cite book\|doi=10.1007/978-3-642-02384-2_22\|chapter=Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware\|title=Progress in Cryptology – AFRICACRYPT 2009\|volume=5580\|pages=350\|series=Lecture Notes in Computer Science\|year=2009\|last1=Harrison\|first1=Owen\|last2=Waldron\|first2=John\|isbn=978-3-642-02383-5\|df=dmy-all\|citeseerx=10.1.1.155.5448}}</ref> [[elliptic curve cryptography]] (ECC) [[Password cracking]]<ref name="gtri">{{cite web\|url=http://www.gtri.gatech.edu/casestudy/Teraflop-Troubles-Power-Graphics-Processing-Units-GPUs-Password-Security-System\|title=Teraflop Troubles: The Power of Graphics Processing Units May Threaten the World's Password Security System\|publisher=[[Georgia Tech Research Institute]]\|access-date=7 November 2010\|url-status=dead\|archive-url=https://web.archive.org/web/20101230063449/http://www.gtri.gatech.edu/casestudy/Teraflop-Troubles-Power-Graphics-Processing-Units-GPUs-Password-Security-System\|archive-date=30 December 2010\|df=dmy-all}}</ref><ref name="msnbc">{{cite news\|url=http://www.nbcnews.com/id/38771772\|archive-url=https://web.archive.org/web/20130711022009/http://www.nbcnews.com/id/38771772/\|url-status=dead\|archive-date=11 July 2013\|title=Want to deter hackers? Make your password longer\|work=[[NBC News]]\|date=19 August 2010\|access-date=7 November 2010\|df=dmy-all}}</ref> ** [[Cryptocurrency]] transactions processing ("mining") ([[Bitcoin network#Mining\|Bitcoin mining]]) * [[Electronic design automation]]<ref>{{Cite news \|url=~~http~~https://www.eetimes.com/~~news~~viewpoint-mass-gpus-not-cpus-for-eda-simulations/~~design/showArticle.jhtml?articleID=216500149~~ \|title=Viewpoint: Mass GPUs, not CPUs for EDA simulations \|first=Larry \|last=Lerner \|date=9 April 2009 \|access-date=314 ~~May~~September ~~2009~~2023 \|newspaper=EE Times }}</ref><ref> Line 307 ⟶ 326: ===Bioinformatics=== GPGPU usage in Bioinformatics:<ref name="Hasan Khondker S. 2014 pp. 612–17"/><ref name="nvidia.com">{{~~cite~~Cite web \|title=GPU-Accelerated Applications \|url=http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf ~~\|title=Archived copy \|access-date=2013-09-12~~ \|url-status=live \|archive-url=https://web.archive.org/web/20130325031816/http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf \|archive-date=25 March 2013 \|access-date=2013-09-12 \|df=dmy-all }}</ref> ~~{{unreferenced section\|date=May 2013}}~~ {\| class="wikitable" ! Application Line 374 ⟶ 392: ==See also== * [[AI accelerator]]▼ * [[Audio processing unit]]▼ * [[Close to Metal]]▼ * [[Deep learning processor]] (DLP)▼ * [[Fastra II]] * [[Larrabee (microarchitecture)]]▼ * Physics engine [[Advanced Simulation Library]] [[Physics processing unit]] (PPU) * {{Annotated link\|Vector processor}} ▲* [[Close to Metal]] * {{Annotated link\|Single instruction, multiple threads}} ▲* [[Audio processing unit]] ▲* [[Larrabee (microarchitecture)]] ▲* [[AI accelerator]] ▲* [[Deep learning processor]] (DLP) ==References== {{Reflist~~\|30em~~}} == Further reading == * {{Cite journal \|last1=Owens \|first1=J.D. \|last2=Houston \|first2=M. \|last3=Luebke \|first3=D. \|last4=Green \|first4=S. \|last5=Stone \|first5=J.E. \|last6=Phillips \|first6=J.C. \|date=May 2008 \|title=GPU Computing \|journal=Proceedings of the IEEE \|volume=96 \|issue=5 \|pages=879–899 \|doi=10.1109/JPROC.2008.917757 \|s2cid=17091128 \|issn=0018-9219}} * {{Cite journal \|last1=Brodtkorb \|first1=André R. \|last2=Hagen \|first2=Trond R. \|last3=Sætra \|first3=Martin L. \|date=2013-01-01 \|title=Graphics processing unit (GPU) programming strategies and trends in GPU computing \|url=https://www.sciencedirect.com/science/article/pii/S0743731512000998 \|journal=Journal of Parallel and Distributed Computing \|series=Metaheuristics on GPUs \|volume=73 \|issue=1 \|pages=4–13 \|doi=10.1016/j.jpdc.2012.04.003 \|issn=0743-7315\|hdl=10852/40283 \|hdl-access=free }} {{Graphics Processing Unit}} Line 394 ⟶ 419: {{DEFAULTSORT:Gpgpu}} [[Category:GPGPU\| ]] ~~[[Category:Emerging technologies]]~~ [[Category:Graphics hardware]] [[Category:Graphics cards]]