Single instruction, multiple threads: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 01:12, 6 August 2025 edit Citation bot (talk \| contribs) Bots 5,865,208 edits Alter: url, title. URLs might have been anonymized. \| Use this bot. Report bugs. \| Suggested by Abductive \| Category:Articles needing cleanup from July 2025 \| #UCB_Category 21/325 ← Previous edit		Latest revision as of 19:43, 24 August 2025 edit undo JonH (talk \| contribs) Extended confirmed users 1,505 edits →Terminology: 6 ed no longer available at Internet Archive, correct the copyright date and ISBN
(20 intermediate revisions by 6 users not shown)
Line 1: {{Short description\|~~Execution~~Parallel ~~model~~computing ~~used~~execution ~~in parallel computing~~model}} {{cleanup\|reason=modern SIMT implementations are proprietary, which leads to misunderstandings as public details are not available. historic SIMT designs such as ILLIAC IV need to be studied and made more prominent in the article.\|date=July 2025}} '''Single instruction, multiple threads''' ('''SIMT''') is an execution model used in [[parallel computing]] where a single central "Control Unit" broadcasts an instruction to multiple "Processing Units" for them to all ''optionally'' perform simultaneous synchronous and fully-independent parallel execution of that one instruction. Each PU has its own independent data and address registers, its own independent Memory, but no PU in the array has a [[Program counter]]. In [[Flynn's taxonomy\|Flynn's 1972 taxonomy]] this arrangement is a variation of [[SIMD]] termed an '''array processor'''. [[Image:ILLIAC_IV.jpg\|thumb\|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971.<ref name="auto">{{Cite web\| title=An introductory description of the Illiac IV system \| url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-date=2024-04-27}}</ref>]] The SIMT execution model has been implemented on several [[GPU]]s and is relevant for [[general-purpose computing on graphics processing units]] (GPGPU), e.g. some [[supercomputer]]s combine CPUs with GPUs: in the [[ILLIAC IV]] that CPU was a [[Burroughs_Large_Systems#B6500,_B6700/B7700,_and_successors\|Burroughs B6500]]. Line 17 ⟶ 15: The key difference between SIMT and [[SIMD lanes]] is that each of the Processing Units in the SIMT Array have their own local memory, and may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas the ALUs in SIMD lanes know nothing about memory per se, and have no [[register file]]. This is illustrated by the [[ILLIAC IV]]. Each SIMT core was termed a ~~Processing~~processing ~~Element~~element (PE), and each PE had its own separate Memory (PEM). Each PE had an "Index register" which was an address into its PEM.<ref>{{Cite web\|url=https://www.researchgate.net/publication/2992993\|title=The Illiac IV system}}</ref><ref name="auto"/> In the [[ILLIAC IV]] the Burroughs B6500 primarily handled I/O, but also sent instructions to the Control Unit (CU), which would then handle the broadcasting to the PEs. Additionally, the B6500, in its role as an I/O processor, had access to ''all'' PEMs. Additionally, each PE may be made active or inactive. If a given PE is inactive it will not execute the instruction broadcast to it by the Control Unit: instead it will sit idle until activated. Each PE can be said to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_Vector_Predication\|Predicated]]. Line 31 ⟶ 29: == Description == SIMT processors execute multiple "threads" (or "work-items" or "Sequence of SIMD Lane operations"), in lock-step, under the control of a single central unit. The model ~~has~~shares ~~much~~common ~~in common~~features with [[SIMD lanes]].<ref>{{cite book \|author1=Michael McCool \|author2=James Reinders \|author3=Arch Robison \|title=Structured Parallel Programming: Patterns for Efficient Computation \|publisher=Elsevier \|year=2013 \|page=52}}</ref> The [[ILLIAC IV]] as the world's first known SIMT processor had its [[ILLIAC_IV#Branches\|"branching"]] mechanism extensively documented, however fascinatingly it turns out to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication\|"predicate masking"]] in modern terminology. As access time of all the widespread [[random-access memory\|RAM]] types (e.g. [[DDR SDRAM]], [[GDDR SDRAM]], [[XDR DRAM]], etc.) is still relatively high, engineers came up with the idea to hide the latency that inevitably comes with each memory access. Strictly, the latency-hiding is a feature of the zero-overhead scheduling implemented by modern GPUs. This might or might not be considered to be a property of 'SIMT' itself.▼ ▲As access time of all the widespread [[random-access memory\|RAM]] types (e.g. [[DDR SDRAM]], [[GDDR SDRAM]], [[XDR DRAM]], etc.) is still relatively high, creating an effect called the [[memory wall]], engineers came up with the idea to hide the latency that inevitably comes with each memory access. ~~Strictly~~As shown in the design of the ILLIAC IV, the ~~latency-hiding~~individual isPEs run at a ~~feature~~slower ofclock rate than a standard CPU, but make up for the ~~zero-overhead~~lack ~~scheduling~~of ~~implemented~~clock rate by ~~modern~~running massively more such PEs in ~~GPUs~~parallel. ~~This~~The ~~might~~upshot oris ~~might~~that ~~not~~each bePE's ~~considered~~slower tospeed beis abetter ~~property~~matched to the speed of ~~'SIMT'~~RAM. The strategy works due to GPU workloads being inherently parallel, and an example is [[tiled ~~itself~~rendering]]. SIMT is intended to limit [[instruction fetching]] overhead,<ref>{{cite conference \|first1=Sean \|last1=Rul \|first2=Hans \|last2=Vandierendonck \|first3=Joris \|last3=D’Haene \|first4=Koen \|last4=De Bosschere \|title=An experimental study on performance portability of OpenCL kernels \|year=2010 \|conference=Symp. Application Accelerators in High Performance Computing (SAAHPC)\|hdl=1854/LU-1016024 \|hdl-access=free }}</ref> i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of [[Nvidia\|NVIDIA]] and [[AMD]]) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. This{{Which\|date=February 2025}} is where the processor is oversubscribed with computation tasks, and is able to quickly switch between tasks when it would otherwise have to wait on memory. This strategy is comparable to [[Hyperthreading\|hyperthreading in CPUs]].<ref>{{cite web \|url=http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/12-advanced_topics_in_cuda.pdf \|title=Advanced Topics in CUDA \|date=2011 \|website=cc.gatech.edu \|access-date=2014-08-28}}</ref> As with SIMD, another major benefit is the sharing of the control logic by many data lanes, leading to an increase in computational density. One block of control logic can manage N data lanes, instead of replicating the control logic N times.▼ ▲SIMT is intended to limit [[instruction fetching]] overhead,<ref>{{cite conference \|first1=Sean \|last1=Rul \|first2=Hans \|last2=Vandierendonck \|first3=Joris \|last3=D’Haene \|first4=Koen \|last4=De Bosschere \|title=An experimental study on performance portability of OpenCL kernels \|year=2010 \|conference=Symp. Application Accelerators in High Performance Computing (SAAHPC)\|hdl=1854/LU-1016024 \|hdl-access=free }}</ref> i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of [[Nvidia\|NVIDIA]] and [[AMD]]) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. This{{Which\|date=February 2025}} is where the processor is oversubscribed with computation tasks, and is able to quickly switch between tasks when it would otherwise have to wait on memory. This strategy is comparable to [[Hyperthreading\|hyperthreading in CPUs]].<ref>{{cite web \|url=http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/12-advanced_topics_in_cuda.pdf \|title=Advanced Topics in CUDA \|date=2011 \|website=cc.gatech.edu \|access-date=2014-08-28}}</ref> As with SIMD, another major benefit is the sharing of the control logic by many data lanes, leading to an increase in computational density. One block of control logic can manage N data lanes, instead of replicating the control logic N times. A downside of SIMT execution is the fact that [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication\|"predicate masking"]] is the only strategy to control per-Processing Element execution, leading to poor utilization in complex algorithms.▼ ▲A downside of SIMT execution is the fact that, as there is only one Program Counter, [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication\|"predicate masking"]] is the only strategy to control per-~~Processing Element~~PE execution, leading to poor utilization in complex algorithms. == Terminology == {\| class="wikitable" style="text-align: center" \|+ SIMT Terminology ! NVIDIA [[CUDA]] \|\| [[OpenCL]] \|\| Hennessy & Patterson<ref>{{cite book \|author1=John L. Hennessy \|author2=David A. Patterson\|title=Computer Architecture: A Quantitative Approach \|year=~~1990 \|url=https://archive.org/details/computerarchitec00patt_045 \|url-access=limited~~2019 \|publisher=Morgan Kaufmann \|edition=6 \|pages=~~[https://archive.org/details/computerarchitec00patt_045/page/n263~~ 314] ff\|isbn=~~9781558600690~~9780128119051 }}</ref> \|- \| Thread \|\| Work-item \|\| Sequence of SIMD Lane operations Line 54: NVIDIA GPUs have a concept of the thread group called as "warp" composed of 32 hardware threads executed in lock-step. The equivalent in AMD GPUs is [[Graphics Core Next#Compute units\|"wavefront"]], although it is composed of 64 hardware threads. In OpenCL, it is called as "sub-group" for the abstract term of warp and wavefront. CUDA also has the warp shuffle instructions which make parallel data exchange in the thread group faster,<ref>{{Cite web\|url=https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/\|title=Faster Parallel Reductions on Kepler\|date=February 14, 2014\|website=NVIDIA Technical Blog}}</ref> and OpenCL allows a similar feature support by an extension cl_khr_subgroups.<ref>{{Cite web\|url=https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/cl_khr_subgroups.html\|title=cl_khr_subgroups(3)\|website=registry.khronos.org}}</ref> ~~== Open hardware SIMT processors ==~~ ~~=== MIAOW GPU ===~~ [[File:MIAOW_GPU_diagram.png\|thumb\|MIAOW GPU and associated Computation Unit block diagram.<ref>{{Cite web \| title=Architecture · VerticalResearchGroup/miaow Wiki · GitHub \| url=https://github.com/VerticalResearchGroup/miaow/wiki/Architecture \| access-date=2025-07-30 \| website=github.com}}</ref> ]] The MIAOW Project by the Vertical Research Group is an implementation of AMDGPU "Southern Islands".<ref>{{Cite web \| title=Vertical Research Group {{!}} Main / Projects \| url=https://research.cs.wisc.edu/vertical/wiki/index.php/Main/Projects#miaow \| access-date=2025-07-30 \| website=research.cs.wisc.edu}}</ref> An overview of the internal architecture and design goals was presented at Hotchips.<ref>{{Cite web\| title=MIAOW - An Open Source GPGPU \| url=https://pages.cs.wisc.edu/~vinay/pubs/MIAOW-HotChips.pdf \| archive-url=https://web.archive.org/web/20240416170322/https://pages.cs.wisc.edu/~vinay/pubs/MIAOW-HotChips.pdf \| archive-date=2024-04-16}}</ref> ~~=== GPU Simulator ===~~ A simulator of a SIMT Architecture, GPGPU-Sim, is developed at the [[University_of_British_Columbia]] by Tor Aamodt along with his graduate students.<ref>{{Cite web \| url=http://gpgpu-sim.org/ \| title=GPGPU-Sim \| website=gpgpu-sim.org}}</ref> ~~=== Vortex GPU ===~~ ~~[[File:Vortex microarchitecture.png\|thumb\|Vortex SIMT GPU Microarchitecture diagram]]~~ The Vortex GPU is an Open Source [[GPGPU]] project by [[Georgia Tech University]] that runs [[OpenCL]]. Technical details:<ref>{{Cite web \| title=vortex/docs/microarchitecture.md at master · vortexgpgpu/vortex · GitHub \| url=https://github.com/vortexgpgpu/vortex/blob/master/docs/microarchitecture.md \| access-date=2025-07-30 \| website=github.com}}</ref> Note a key defining characteristics of SIMT: the ''PC is shared''. However note also that time-multiplexing is used, giving the impression that it has more Array Processing Elements than there actually are. == See also ==