Single instruction, multiple threads: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 09:09, 10 March 2016 edit ScotXW (talk \| contribs) Extended confirmed users 14,442 edits m ref fix ← Previous edit		Latest revision as of 19:43, 24 August 2025 edit undo JonH (talk \| contribs) Extended confirmed users 1,505 edits →Terminology: 6 ed no longer available at Internet Archive, correct the copyright date and ISBN
(116 intermediate revisions by 47 users not shown)
Line 1: {{Short description\|Parallel computing execution model}} '''Single instruction, multiple thread''' (SIMT) is a [[parallel computing\|parallel]] execution model, used in some [[GPGPU]] platforms, where [[Thread (computing)#Multithreading\|multithreading]] is simulated by [[SIMD]] processors. The processors, say a number {{mvar\|p}} of them, seem to execute many more than {{mvar\|p}} tasks. This is achieved by each processor having multiple "threads" (or "work-items" or "Sequence of SIMD Lane operations"), which execute in lock-step, and are analogous to SIMD "lanes".<ref>{{cite book \|author1=Michael McCool \|author2=James Reinders \|author3=Arch Robison \|title=Structured Parallel Programming: Patterns for Efficient Computation \|publisher=Elsevier \|year=2013 \|page=52}}</ref>▼ '''Single instruction, multiple threads''' ('''SIMT''') is an execution model used in [[parallel computing]] where a single central "Control Unit" broadcasts an instruction to multiple "Processing Units" for them to all ''optionally'' perform simultaneous synchronous and fully-independent parallel execution of that one instruction. Each PU has its own independent data and address registers, its own independent Memory, but no PU in the array has a [[Program counter]]. In [[Flynn's taxonomy\|Flynn's 1972 taxonomy]] this arrangement is a variation of [[SIMD]] termed an '''array processor'''. SIMT was introduced by [[Nvidia]]:<ref>{{cite web \|url=http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf \|title=Nvidia Fermi Compute Arcitecture Whitepaper \|date=2009 \|website=http://www.nvidia.com/ \|publisher=NVIDIA Corporation \|accessdate=2014-07-17}}</ref><ref name=teslaPaper>{{cite web \|url=http://dx.doi.org/10.1109/MM.2008.31 \|title=NVIDIA Tesla: A Unified Graphics and Computing Architecture \|date=2008 \|website=http://www.ieee.org/ \|publisher=IEEE \|accessdate=2014-08-07 \|page=6 {{subscription required\|s}} }}</ref> [[Image:ILLIAC_IV.jpg\|thumb\|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971<ref name="auto">{{Cite web\| title=An introductory description of the Illiac IV system \| url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-date=2024-04-27}}</ref>]] {{Quote\| [The G80 Nvidia GPU architecture, [[Tesla (microarchitecture)\|Tesla]]] introduced the single-instruction multiple-thread (SIMT) execution model where multiple independent threads execute concurrently using a single instruction.}} The SIMT execution model has been implemented on several [[GPU]]s and is relevant for [[general-purpose computing on graphics processing units]] (GPGPU), e.g. some [[supercomputer]]s combine CPUs with GPUs: in the [[ILLIAC IV]] that CPU was a [[Burroughs_Large_Systems#B6500,_B6700/B7700,_and_successors\|Burroughs B6500]]. SIMT is intended to limit [[instruction fetching]] overhead,<ref>{{cite conference \|first1=Sean \|last1=Rul \|first2=Hans \|last2=Vandierendonck \|first3=Joris \|last3=D’Haene \|first4=Koen \|last4=De Bosschere \|title=An experimental study on performance portability of OpenCL kernels \|year=2010 \|conference=Symp. Application Accelerators in High Performance Computing (SAAHPC)}}</ref> and is used in modern GPUs (including, but not limited to those of [[Nvidia]] and [[AMD]]) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. This is where the processor is oversubscribed with computation tasks, and is able to quickly switch between tasks when it would otherwise have to wait on memory. This strategy is comparable to [[Multithreading (computer architecture)\|multithreading in CPUs]] (not to be confused with [[Multi-core processor\|multi-core]]).<ref>{{cite web \|url=http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/12-advanced_topics_in_cuda.pdf \|title=Advanced Topics in CUDA \|date=2011 \|website=cc.gatech.edu \|accessdate=2014-08-28}}</ref>▼ The SIMT execution model is still only a way to present to the programmer what is fundamentally still a Predicated SIMD concept. Programs must be designed with Predicated SIMD in mind. With Instruction Issue (as a synchronous broadcast) being handled by the single Control Unit, SIMT cannot ''by design'' allow threads (PEs, Lanes) to diverge by branching, because only the Control Unit has a Program Counter. If possible, therefore, branching is to be avoided.<ref>{{Cite web \| title=SIMT Model - Open Source General-Purpose Computing Chip Platform - Blue Porcelain(GPGPU) \| url=https://gpgpuarch.org/en/basic/simt/ \| access-date=2025-07-30 \| website=gpgpuarch.org}}</ref> ~~<!-- Strictly, the latency-hiding is a feature of the zero-overhead scheduling implemented by modern GPUs... this might or might not be considered to be a property of 'SIMT' itself -->~~ <ref>{{Cite web \| title=General-Purpose Graphics Processor Architecture - Chapter 3 - The SIMT Core: Instruction and Register Data Flow (Part 1) {{!}} FANnotes \| url=https://www.fannotes.me/article/gpgpu_architecture/chapter_3_the_simt_core_instruction_and_register_data_flow_part_1 \| access-date=2025-07-30 \| website=www.fannotes.me}}</ref> == Differences from other models == A downside of SIMT execution is the fact that thread-specific control-flow is performed using "masking", leading to poor utilisation where a processor's threads follow different control-flow paths. For instance, to handle an ''if''-''else'' block where various threads of a processor execute different paths, all threads must actually process both paths (as all threads of a processor always execute in lock-step), but masking is used to disable and enable the various threads as appropriate. Masking is avoided when control flow is coherent for the threads of a processor, i.e. they all follow the same path of execution. The masking strategy is what distinguishes SIMT from ordinary SIMD, and has the benefit of inexpensive synchronization between the threads of a processor.<ref>{{cite book \|author1=Michael McCool \|author2=James Reinders \|author3=Arch Robison \|title=Structured Parallel Programming: Patterns for Efficient Computation \|publisher=Elsevier \|year=2013 \|pages=209 ff.}}</ref> The simplest way to understand SIMT is to imagine a multi-core ([[Multiple_instruction,_multiple_data\|MIMD]]) system, where each core has its own register file, its own [[Arithmetic logic unit\|ALUs]] (both SIMD and Scalar) and its own data cache, but that unlike a standard multi-core system which has multiple independent instruction caches and decoders, as well as multiple independent Program Counter registers, the instructions are synchronously '''broadcast''' to all SIMT cores from a '''single''' unit with a single instruction cache and a single instruction decoder which reads instructions using a single Program Counter. {\| class="wikitable" style="style="font-size:80%; text-align: center"▼ ~~! Nvidia [[CUDA]] \|\| AMD [[OpenCL]] \|\| Henn&Pratt~~ The key difference between SIMT and [[SIMD lanes]] is that each of the Processing Units in the SIMT Array have their own local memory, and may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas the ALUs in SIMD lanes know nothing about memory per se, and have no [[register file]]. This is illustrated by the [[ILLIAC IV]]. Each SIMT core was termed a processing element (PE), and each PE had its own separate Memory (PEM). Each PE had an "Index register" which was an address into its PEM.<ref>{{Cite web\|url=https://www.researchgate.net/publication/2992993\|title=The Illiac IV system}}</ref><ref name="auto"/> In the [[ILLIAC IV]] the Burroughs B6500 primarily handled I/O, but also sent instructions to the Control Unit (CU), which would then handle the broadcasting to the PEs. Additionally, the B6500, in its role as an I/O processor, had access to ''all'' PEMs. Additionally, each PE may be made active or inactive. If a given PE is inactive it will not execute the instruction broadcast to it by the Control Unit: instead it will sit idle until activated. Each PE can be said to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_Vector_Predication\|Predicated]]. Also important to note is the difference between SIMT and [[SPMD]] - Single Program Multiple Data. SPMD, like standard multi-core systems, has multiple Program Counters, where SIMT only has one: in the (one) Control Unit. ==History== In [[Flynn's taxonomy]], Flynn's original papers cite two historic examples of SIMT processors termed "Array Processors": the [[ILLIAC IV#SOLOMON\|SOLOMON]] and [[ILLIAC IV]].<ref name="auto"/> SIMT was introduced by [[Nvidia\|NVIDIA]] in the [[Tesla (microarchitecture)\|Tesla GPU microarchitecture]] with the G80 chip.<ref>{{cite web \|url=http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf \|title=NVIDIA Fermi Compute Architecture Whitepaper \|date=2009 \|website=www.nvidia.com \|publisher=NVIDIA Corporation \|access-date=2014-07-17}}</ref><ref name=teslaPaper>{{cite journal \|title=NVIDIA Tesla: A Unified Graphics and Computing Architecture \|date=2008 \|page=6 {{subscription required\|s}} \|doi=10.1109/MM.2008.31 \|volume=28 \|issue=2 \|journal=IEEE Micro\|last1=Lindholm \|first1=Erik \|last2=Nickolls \|first2=John \|last3=Oberman \|first3=Stuart \|last4=Montrym \|first4=John \|bibcode=2008IMicr..28b..39L \|s2cid=2793450 }}</ref> [[ATI Technologies]], now [[Advanced Micro Devices\|AMD]], released a competing product slightly later on May 14, 2007, the [[TeraScale (microarchitecture)#TeraScale 1\|TeraScale 1]]-based ''"R600"'' GPU chip. == Description == ▲~~'''Single instruction, multiple thread''' (~~SIMT~~) is a [[parallel computing\|parallel]] execution model, used in some [[GPGPU]] platforms, where [[Thread (computing)#Multithreading\|multithreading]] is simulated by [[SIMD]]~~ processors~~. The processors, say a number {{mvar\|p}} of them, seem to~~ execute ~~many more than {{mvar\|p}} tasks. This is achieved by each processor having~~ multiple "threads" (or "work-items" or "Sequence of SIMD Lane operations"), ~~which execute~~ in lock-step, ~~and~~under ~~are~~the ~~analogous~~control toof a single central unit. The model shares common features with [[SIMD "lanes"]].<ref>{{cite book \|author1=Michael McCool \|author2=James Reinders \|author3=Arch Robison \|title=Structured Parallel Programming: Patterns for Efficient Computation \|publisher=Elsevier \|year=2013 \|page=52}}</ref> The [[ILLIAC IV]] as the world's first known SIMT processor had its [[ILLIAC_IV#Branches\|"branching"]] mechanism extensively documented, however fascinatingly it turns out to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication\|"predicate masking"]] in modern terminology. As access time of all the widespread [[random-access memory\|RAM]] types (e.g. [[DDR SDRAM]], [[GDDR SDRAM]], [[XDR DRAM]], etc.) is still relatively high, creating an effect called the [[memory wall]], engineers came up with the idea to hide the latency that inevitably comes with each memory access. As shown in the design of the ILLIAC IV, the individual PEs run at a slower clock rate than a standard CPU, but make up for the lack of clock rate by running massively more such PEs in parallel. The upshot is that each PE's slower speed is better matched to the speed of RAM. The strategy works due to GPU workloads being inherently parallel, and an example is [[tiled rendering]]. ▲SIMT is intended to limit [[instruction fetching]] overhead,<ref>{{cite conference \|first1=Sean \|last1=Rul \|first2=Hans \|last2=Vandierendonck \|first3=Joris \|last3=D’Haene \|first4=Koen \|last4=De Bosschere \|title=An experimental study on performance portability of OpenCL kernels \|year=2010 \|conference=Symp. Application Accelerators in High Performance Computing (SAAHPC)\|hdl=1854/LU-1016024 \|hdl-access=free }}</ref> i.e. the latency that comes with memory access, and is used in modern GPUs (~~including,~~such ~~but not limited to~~as those of [[Nvidia\|NVIDIA]] and [[AMD]]) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. ~~This~~As iswith ~~where~~SIMD, ~~the~~another ~~processor~~major ~~is oversubscribed with computation tasks, and~~benefit is ~~able~~the tosharing ~~quickly~~of ~~switch~~the ~~between~~control ~~tasks~~logic ~~when~~by itmany ~~would~~data ~~otherwise~~lanes, ~~have~~leading to ~~wait~~an onincrease ~~memory.~~in ~~This~~computational ~~strategy~~density. isOne ~~comparable~~block toof ~~[[Multithreading~~control ~~(computer~~logic ~~architecture)\|multithreading~~can inmanage ~~CPUs]]~~N ~~(not~~data tolanes, beinstead ~~confused~~of ~~with~~replicating ~~[[Multi-core~~the ~~processor\|multi-core]]).<ref>{{cite~~control ~~web~~logic ~~\|url=http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/12-advanced_topics_in_cuda.pdf~~N ~~\|title=Advanced Topics in CUDA \|date=2011 \|website=cc~~times.~~gatech.edu \|accessdate=2014-08-28}}</ref>~~ A downside of SIMT execution is the fact that, as there is only one Program Counter, [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication\|"predicate masking"]] is the only strategy to control per-PE execution, leading to poor utilization in complex algorithms. == Terminology == ▲{\| class="wikitable" style="~~style="font-size:80%;~~ text-align: center" \|+ SIMT Terminology ! NVIDIA [[CUDA]] \|\| [[OpenCL]] \|\| Hennessy & Patterson<ref>{{cite book \|author1=John L. Hennessy \|author2=David A. Patterson\|title=Computer Architecture: A Quantitative Approach \|year=2019 \|publisher=Morgan Kaufmann \|edition=6 \|pages=314 ff\|isbn=9780128119051 }}</ref> \|- \| Thread \|\| Work-item \|\| Sequence of SIMD Lane operations \|- \| Warp \|\| ~~[[Graphics_Core_Next#Compute_Units\|Wavefront]]~~Sub-group \|\| Thread of SIMD Instructions \|- \| Block \|\| ~~Workgroup~~ Work-group \|\| Body of vectorized loop \|- \| Grid \|\| NDRange \|\| Vectorized loop \|} NVIDIA GPUs have a concept of the thread group called as "warp" composed of 32 hardware threads executed in lock-step. The equivalent in AMD GPUs is [[Graphics Core Next#Compute units\|"wavefront"]], although it is composed of 64 hardware threads. In OpenCL, it is called as "sub-group" for the abstract term of warp and wavefront. CUDA also has the warp shuffle instructions which make parallel data exchange in the thread group faster,<ref>{{Cite web\|url=https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/\|title=Faster Parallel Reductions on Kepler\|date=February 14, 2014\|website=NVIDIA Technical Blog}}</ref> and OpenCL allows a similar feature support by an extension cl_khr_subgroups.<ref>{{Cite web\|url=https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/cl_khr_subgroups.html\|title=cl_khr_subgroups(3)\|website=registry.khronos.org}}</ref> == See also == * [[General-purpose computing on graphics processing units]] (GPGPU) * [[Thread block (CUDA programming)]] == References == {{Reflist}} {{Graphics Processing Unit}} [[Category:Computer architecture]]▼ [[Category:Classes of computers]] ▲[[Category:Computer architecture]] [[Category:GPGPU]] [[Category:Parallel computing]] [[Category:SIMD computing]] [[Category:Threads (computing)]] ~~{{comp-sci-stub}}~~