Content deleted Content added
→Description: add heading separating terminology Tags: Mobile edit Mobile web edit Advanced mobile edit |
→Terminology: 6 ed no longer available at Internet Archive, correct the copyright date and ISBN |
||
(26 intermediate revisions by 9 users not shown) | |||
Line 1:
{{Short description|
'''Single instruction, multiple threads''' ('''SIMT''') is an execution model used in [[parallel computing]] where a single central "Control Unit" broadcasts an instruction to multiple "Processing Units" for them to all ''optionally'' perform simultaneous synchronous and fully-independent parallel execution of that one instruction. Each PU has its own independent data and address registers, its own independent Memory, but no PU in the array has a [[Program counter]]. In [[Flynn's taxonomy|Flynn's 1972 taxonomy]] this arrangement is a variation of [[SIMD]] termed an '''array processor'''.
[[Image:ILLIAC_IV.jpg|thumb|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971
The SIMT execution model has been implemented on several [[GPU]]s and is relevant for [[general-purpose computing on graphics processing units]] (GPGPU), e.g. some [[supercomputer]]s combine CPUs with GPUs. In the [[ILLIAC IV]] the CPU was a [[Burroughs_Large_Systems#B6500,_B6700/B7700,_and_successors|Burroughs B6500]].▼
▲The SIMT execution model has been implemented on several [[GPU]]s and is relevant for [[general-purpose computing on graphics processing units]] (GPGPU), e.g. some [[supercomputer]]s combine CPUs with GPUs
The processors, say a number {{mvar|p}} of them, seem to execute many more than {{mvar|p}} tasks. This is achieved by each processor having multiple "threads" (or "work-items" or "Sequence of SIMD Lane operations"), which execute in lock-step, and are analogous to [[SIMD lanes]].<ref>{{cite book |author1=Michael McCool |author2=James Reinders |author3=Arch Robison |title=Structured Parallel Programming: Patterns for Efficient Computation |publisher=Elsevier |year=2013 |page=52}}</ref>▼
The SIMT execution model is still only a way to present to the programmer what is fundamentally still a Predicated SIMD concept. Programs must be designed with Predicated SIMD in mind. With Instruction Issue (as a synchronous broadcast) being handled by the single Control Unit, SIMT cannot ''by design'' allow threads (PEs, Lanes) to diverge by branching, because only the Control Unit has a Program Counter. If possible, therefore, branching is to be avoided.<ref>{{Cite web | title=SIMT Model - Open Source General-Purpose Computing Chip Platform - Blue Porcelain(GPGPU) | url=https://gpgpuarch.org/en/basic/simt/ | access-date=2025-07-30 | website=gpgpuarch.org}}</ref>
<ref>{{Cite web | title=General-Purpose Graphics Processor Architecture - Chapter 3 - The SIMT Core: Instruction and Register Data Flow (Part 1) {{!}} FANnotes | url=https://www.fannotes.me/article/gpgpu_architecture/chapter_3_the_simt_core_instruction_and_register_data_flow_part_1 | access-date=2025-07-30 | website=www.fannotes.me}}</ref>
The simplest way to understand SIMT is to imagine a multi-core ([[Multiple_instruction,_multiple_data|MIMD]]) system, where each core has its own register file, its own [[Arithmetic logic unit|ALUs]] (both SIMD and Scalar) and its own data cache, but that unlike a standard multi-core system which has multiple independent instruction caches and decoders, as well as multiple independent Program Counter registers, the instructions are synchronously '''broadcast''' to all SIMT cores from a '''single''' unit with a single instruction cache and a single instruction decoder which reads instructions using a single Program Counter.
The key difference between SIMT and [[SIMD lanes]] is that each of the Processing Units in the SIMT Array have their own local memory, and may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas the ALUs in SIMD lanes know nothing about memory per se, and have no [[register file]].
This is illustrated by the [[ILLIAC IV]]. Each SIMT core was termed a
In the [[ILLIAC IV]] the Burroughs B6500 primarily handled I/O, but also sent instructions to the Control Unit (CU), which would then handle the broadcasting to the PEs. Additionally, the B6500, in its role as an I/O processor, had access to ''all'' PEMs.
Additionally, each PE may be made active or inactive. If a given PE is inactive it will not execute the instruction broadcast to it by the Control Unit: instead it will sit idle until activated. Each PE can be said to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_Vector_Predication|Predicated]].
Line 28 ⟶ 24:
==History==
In [[Flynn's taxonomy]], Flynn's original papers cite two historic examples of SIMT processors termed "Array Processors": the [[ILLIAC IV#SOLOMON|SOLOMON]] and [[ILLIAC IV]].<ref
SIMT was introduced by [[Nvidia|NVIDIA]] in the [[Tesla (microarchitecture)|Tesla GPU microarchitecture]] with the G80 chip.<ref>{{cite web |url=http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf |title=NVIDIA Fermi Compute Architecture Whitepaper |date=2009 |website=www.nvidia.com |publisher=NVIDIA Corporation |access-date=2014-07-17}}</ref><ref name=teslaPaper>{{cite journal |title=NVIDIA Tesla: A Unified Graphics and Computing Architecture |date=2008 |page=6 {{subscription required|s}} |doi=10.1109/MM.2008.31 |volume=28 |issue=2 |journal=IEEE Micro|last1=Lindholm |first1=Erik |last2=Nickolls |first2=John |last3=Oberman |first3=Stuart |last4=Montrym |first4=John |bibcode=2008IMicr..28b..39L |s2cid=2793450 }}</ref> [[ATI Technologies]], now [[Advanced Micro Devices|AMD]], released a competing product slightly later on May 14, 2007, the [[TeraScale (microarchitecture)#TeraScale 1|TeraScale 1]]-based ''"R600"'' GPU chip.
== Description ==
▲
As access time of all the widespread [[random-access memory|RAM]] types (e.g. [[DDR SDRAM]], [[GDDR SDRAM]], [[XDR DRAM]], etc.) is still relatively high, engineers came up with the idea to hide the latency that inevitably comes with each memory access. Strictly, the latency-hiding is a feature of the zero-overhead scheduling implemented by modern GPUs. This might or might not be considered to be a property of 'SIMT' itself.▼
The [[ILLIAC IV]] as the world's first known SIMT processor had its [[ILLIAC_IV#Branches|"branching"]] mechanism extensively documented, however fascinatingly it turns out to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|"predicate masking"]] in modern terminology.
SIMT is intended to limit [[instruction fetching]] overhead,<ref>{{cite conference |first1=Sean |last1=Rul |first2=Hans |last2=Vandierendonck |first3=Joris |last3=D’Haene |first4=Koen |last4=De Bosschere |title=An experimental study on performance portability of OpenCL kernels |year=2010 |conference=Symp. Application Accelerators in High Performance Computing (SAAHPC)|hdl=1854/LU-1016024 |hdl-access=free }}</ref> i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of [[Nvidia|NVIDIA]] and [[AMD]]) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. This{{Which|date=February 2025}} is where the processor is oversubscribed with computation tasks, and is able to quickly switch between tasks when it would otherwise have to wait on memory. This strategy is comparable to [[Hyperthreading|hyperthreading in CPUs]].<ref>{{cite web |url=http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/12-advanced_topics_in_cuda.pdf |title=Advanced Topics in CUDA |date=2011 |website=cc.gatech.edu |access-date=2014-08-28}}</ref> As with SIMD, another major benefit is the sharing of the control logic by many data lanes, leading to an increase in computational density. One block of control logic can manage N data lanes, instead of replicating the control logic N times.▼
▲As access time of all the widespread [[random-access memory|RAM]] types (e.g. [[DDR SDRAM]], [[GDDR SDRAM]], [[XDR DRAM]], etc.) is still relatively high, creating an effect called the [[memory wall]], engineers came up with the idea to hide the latency that inevitably comes with each memory access.
A downside of SIMT execution is the fact that [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|"predicate masking"]] is the only strategy to control per-Processing Element execution, leading to poor utilization in complex algorithms.▼
▲SIMT is intended to limit [[instruction fetching]] overhead,<ref>{{cite conference |first1=Sean |last1=Rul |first2=Hans |last2=Vandierendonck |first3=Joris |last3=D’Haene |first4=Koen |last4=De Bosschere |title=An experimental study on performance portability of OpenCL kernels |year=2010 |conference=Symp. Application Accelerators in High Performance Computing (SAAHPC)|hdl=1854/LU-1016024 |hdl-access=free }}</ref> i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of [[Nvidia|NVIDIA]] and [[AMD]]) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations.
▲A downside of SIMT execution is the fact that, as there is only one Program Counter, [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|"predicate masking"]] is the only strategy to control per-
== Terminology ==
{| class="wikitable" style="text-align: center"
|+ SIMT Terminology
! NVIDIA [[CUDA]] || [[OpenCL]] || Hennessy & Patterson<ref>{{cite book |author1=John L. Hennessy |author2=David A. Patterson|title=Computer Architecture: A Quantitative Approach |year=
|-
| Thread || Work-item || Sequence of SIMD Lane operations
Line 53:
|}
NVIDIA GPUs have a concept of the thread group called as "warp" composed of 32 hardware threads executed in lock-step. The equivalent in AMD GPUs is [[Graphics Core Next#Compute units|"wavefront"]], although it is composed of 64 hardware threads. In OpenCL, it is called as "sub-group" for the abstract term of warp and wavefront. CUDA also has the warp shuffle instructions which make parallel data exchange in the thread group faster,<ref>
== See also ==
|