Content deleted Content added
Citation bot (talk | contribs) Alter: url, title. URLs might have been anonymized. | Use this bot. Report bugs. | Suggested by Abductive | Category:Articles needing cleanup from July 2025 | #UCB_Category 21/325 |
→Terminology: 6 ed no longer available at Internet Archive, correct the copyright date and ISBN |
||
(20 intermediate revisions by 6 users not shown) | |||
Line 1:
{{Short description|
'''Single instruction, multiple threads''' ('''SIMT''') is an execution model used in [[parallel computing]] where a single central "Control Unit" broadcasts an instruction to multiple "Processing Units" for them to all ''optionally'' perform simultaneous synchronous and fully-independent parallel execution of that one instruction. Each PU has its own independent data and address registers, its own independent Memory, but no PU in the array has a [[Program counter]]. In [[Flynn's taxonomy|Flynn's 1972 taxonomy]] this arrangement is a variation of [[SIMD]] termed an '''array processor'''.
[[Image:ILLIAC_IV.jpg|thumb|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971
The SIMT execution model has been implemented on several [[GPU]]s and is relevant for [[general-purpose computing on graphics processing units]] (GPGPU), e.g. some [[supercomputer]]s combine CPUs with GPUs: in the [[ILLIAC IV]] that CPU was a [[Burroughs_Large_Systems#B6500,_B6700/B7700,_and_successors|Burroughs B6500]].
Line 17 ⟶ 15:
The key difference between SIMT and [[SIMD lanes]] is that each of the Processing Units in the SIMT Array have their own local memory, and may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas the ALUs in SIMD lanes know nothing about memory per se, and have no [[register file]].
This is illustrated by the [[ILLIAC IV]]. Each SIMT core was termed a
In the [[ILLIAC IV]] the Burroughs B6500 primarily handled I/O, but also sent instructions to the Control Unit (CU), which would then handle the broadcasting to the PEs. Additionally, the B6500, in its role as an I/O processor, had access to ''all'' PEMs.
Additionally, each PE may be made active or inactive. If a given PE is inactive it will not execute the instruction broadcast to it by the Control Unit: instead it will sit idle until activated. Each PE can be said to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_Vector_Predication|Predicated]].
Line 31 ⟶ 29:
== Description ==
SIMT processors execute multiple "threads" (or "work-items" or "Sequence of SIMD Lane operations"), in lock-step, under the control of a single central unit. The model
The [[ILLIAC IV]] as the world's first known SIMT processor had its [[ILLIAC_IV#Branches|"branching"]] mechanism extensively documented, however fascinatingly it turns out to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|"predicate masking"]] in modern terminology.
As access time of all the widespread [[random-access memory|RAM]] types (e.g. [[DDR SDRAM]], [[GDDR SDRAM]], [[XDR DRAM]], etc.) is still relatively high, engineers came up with the idea to hide the latency that inevitably comes with each memory access. Strictly, the latency-hiding is a feature of the zero-overhead scheduling implemented by modern GPUs. This might or might not be considered to be a property of 'SIMT' itself.▼
▲As access time of all the widespread [[random-access memory|RAM]] types (e.g. [[DDR SDRAM]], [[GDDR SDRAM]], [[XDR DRAM]], etc.) is still relatively high, creating an effect called the [[memory wall]], engineers came up with the idea to hide the latency that inevitably comes with each memory access.
SIMT is intended to limit [[instruction fetching]] overhead,<ref>{{cite conference |first1=Sean |last1=Rul |first2=Hans |last2=Vandierendonck |first3=Joris |last3=D’Haene |first4=Koen |last4=De Bosschere |title=An experimental study on performance portability of OpenCL kernels |year=2010 |conference=Symp. Application Accelerators in High Performance Computing (SAAHPC)|hdl=1854/LU-1016024 |hdl-access=free }}</ref> i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of [[Nvidia|NVIDIA]] and [[AMD]]) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. This{{Which|date=February 2025}} is where the processor is oversubscribed with computation tasks, and is able to quickly switch between tasks when it would otherwise have to wait on memory. This strategy is comparable to [[Hyperthreading|hyperthreading in CPUs]].<ref>{{cite web |url=http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/12-advanced_topics_in_cuda.pdf |title=Advanced Topics in CUDA |date=2011 |website=cc.gatech.edu |access-date=2014-08-28}}</ref> As with SIMD, another major benefit is the sharing of the control logic by many data lanes, leading to an increase in computational density. One block of control logic can manage N data lanes, instead of replicating the control logic N times.▼
▲SIMT is intended to limit [[instruction fetching]] overhead,<ref>{{cite conference |first1=Sean |last1=Rul |first2=Hans |last2=Vandierendonck |first3=Joris |last3=D’Haene |first4=Koen |last4=De Bosschere |title=An experimental study on performance portability of OpenCL kernels |year=2010 |conference=Symp. Application Accelerators in High Performance Computing (SAAHPC)|hdl=1854/LU-1016024 |hdl-access=free }}</ref> i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of [[Nvidia|NVIDIA]] and [[AMD]]) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations.
A downside of SIMT execution is the fact that [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|"predicate masking"]] is the only strategy to control per-Processing Element execution, leading to poor utilization in complex algorithms.▼
▲A downside of SIMT execution is the fact that, as there is only one Program Counter, [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|"predicate masking"]] is the only strategy to control per-
== Terminology ==
{| class="wikitable" style="text-align: center"
|+ SIMT Terminology
! NVIDIA [[CUDA]] || [[OpenCL]] || Hennessy & Patterson<ref>{{cite book |author1=John L. Hennessy |author2=David A. Patterson|title=Computer Architecture: A Quantitative Approach |year=
|-
| Thread || Work-item || Sequence of SIMD Lane operations
Line 54:
NVIDIA GPUs have a concept of the thread group called as "warp" composed of 32 hardware threads executed in lock-step. The equivalent in AMD GPUs is [[Graphics Core Next#Compute units|"wavefront"]], although it is composed of 64 hardware threads. In OpenCL, it is called as "sub-group" for the abstract term of warp and wavefront. CUDA also has the warp shuffle instructions which make parallel data exchange in the thread group faster,<ref>{{Cite web|url=https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/|title=Faster Parallel Reductions on Kepler|date=February 14, 2014|website=NVIDIA Technical Blog}}</ref> and OpenCL allows a similar feature support by an extension cl_khr_subgroups.<ref>{{Cite web|url=https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/cl_khr_subgroups.html|title=cl_khr_subgroups(3)|website=registry.khronos.org}}</ref>
== See also ==
|