Revision as of 16:40, 17 August 2025 edit Maxeto0910 (talk \| contribs) Extended confirmed users 116,764 edits no sentence Tag: Visual edit ← Previous edit		Revision as of 15:47, 24 August 2025 edit undo Kvng (talk \| contribs) Extended confirmed users, New page reviewers 115,948 edits shorten SD. caps. acro def and use. simplify link. incorporate paren. Next edit →
Line 1: {{Short description\|Parallel computing execution model}} {{Short description\|Parallel Execution model which works simultaneously on arrays of several numbers}}'''Single instruction, multiple threads''' ('''SIMT''') is an execution model used in [[parallel computing]] where a single central "Control Unit" broadcasts an instruction to multiple "Processing Units" for them to all ''optionally'' perform simultaneous synchronous and fully-independent parallel execution of that one instruction. Each PU has its own independent data and address registers, its own independent Memory, but no PU in the array has a [[Program counter]]. In [[Flynn's taxonomy\|Flynn's 1972 taxonomy]] this arrangement is a variation of [[SIMD]] termed an '''array processor'''.▼ ▲~~{{Short description\|Parallel Execution model which works simultaneously on arrays of several numbers}}~~'''Single instruction, multiple threads''' ('''SIMT''') is an execution model used in [[parallel computing]] where a single central "Control Unit" broadcasts an instruction to multiple "Processing Units" for them to all ''optionally'' perform simultaneous synchronous and fully-independent parallel execution of that one instruction. Each PU has its own independent data and address registers, its own independent Memory, but no PU in the array has a [[Program counter]]. In [[Flynn's taxonomy\|Flynn's 1972 taxonomy]] this arrangement is a variation of [[SIMD]] termed an '''array processor'''. [[Image:ILLIAC_IV.jpg\|thumb\|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971<ref name="auto">{{Cite web\| title=An introductory description of the Illiac IV system \| url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-date=2024-04-27}}</ref>]] Line 13 ⟶ 15: The key difference between SIMT and [[SIMD lanes]] is that each of the Processing Units in the SIMT Array have their own local memory, and may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas the ALUs in SIMD lanes know nothing about memory per se, and have no [[register file]]. This is illustrated by the [[ILLIAC IV]]. Each SIMT core was termed a ~~Processing~~processing ~~Element~~element (PE), and each PE had its own separate Memory (PEM). Each PE had an "Index register" which was an address into its PEM.<ref>{{Cite web\|url=https://www.researchgate.net/publication/2992993\|title=The Illiac IV system}}</ref><ref name="auto"/> In the [[ILLIAC IV]] the Burroughs B6500 primarily handled I/O, but also sent instructions to the Control Unit (CU), which would then handle the broadcasting to the PEs. Additionally, the B6500, in its role as an I/O processor, had access to ''all'' PEMs. Additionally, each PE may be made active or inactive. If a given PE is inactive it will not execute the instruction broadcast to it by the Control Unit: instead it will sit idle until activated. Each PE can be said to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_Vector_Predication\|Predicated]]. Line 31 ⟶ 33: The [[ILLIAC IV]] as the world's first known SIMT processor had its [[ILLIAC_IV#Branches\|"branching"]] mechanism extensively documented, however fascinatingly it turns out to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication\|"predicate masking"]] in modern terminology. As access time of all the widespread [[random-access memory\|RAM]] types (e.g. [[DDR SDRAM]], [[GDDR SDRAM]], [[XDR DRAM]], etc.) is still relatively high, creating an effect called the [[~~Random-access_memory#Memory_wall\|Memory~~memory wall]], engineers came up with the idea to hide the latency that inevitably comes with each memory access. As shown in the design of the ILLIAC IV, the individual ~~Processing Elements~~PEs run at a slower clock rate than a standard CPU, but make up for the "lack" of clock rate by running massively more such PEs in parallel. The upshot is that each PE's (slower) speed is better matched to the speed of RAM. The strategy works due to GPU workloads being inherently parallel, and an example is [[~~Tiled~~tiled rendering]]. SIMT is intended to limit [[instruction fetching]] overhead,<ref>{{cite conference \|first1=Sean \|last1=Rul \|first2=Hans \|last2=Vandierendonck \|first3=Joris \|last3=D’Haene \|first4=Koen \|last4=De Bosschere \|title=An experimental study on performance portability of OpenCL kernels \|year=2010 \|conference=Symp. Application Accelerators in High Performance Computing (SAAHPC)\|hdl=1854/LU-1016024 \|hdl-access=free }}</ref> i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of [[Nvidia\|NVIDIA]] and [[AMD]]) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. As with SIMD, another major benefit is the sharing of the control logic by many data lanes, leading to an increase in computational density. One block of control logic can manage N data lanes, instead of replicating the control logic N times. A downside of SIMT execution is the fact that, as there is only one Program Counter, [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication\|"predicate masking"]] is the only strategy to control per-~~Processing Element~~PE execution, leading to poor utilization in complex algorithms. == Terminology ==

Single instruction, multiple threads: Difference between revisions