Single instruction, multiple threads: Difference between revisions

Content deleted Content added
m Replaced 0 bare URLs by {{Cite web}}; Replaced "Archived copy" by actual titles
Filled in 3 bare reference(s) with reFill 2
Line 5:
'''Single instruction, multiple threads''' ('''SIMT''') is an execution model used in [[parallel computing]] where a single central "Control Unit" broadcasts an instruction to multiple "Processing Units" for them to all ''optionally'' perform simultaneous synchronous and fully-independent parallel execution of that one instruction. Each PU has its own independent data and address registers, its own independent Memory, but no PU in the array has a [[Program counter]]. In [[Flynn's taxonomy|Flynn's 1972 taxonomy]] this arrangement is a variation of [[SIMD]] termed an '''array processor'''.
 
[[Image:ILLIAC_IV.jpg|thumb|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971.<ref name="auto">{{Cite web| title=An introductory description of the Illiac IV system | url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-date=2024-04-27}}</ref>]]
 
The SIMT execution model has been implemented on several [[GPU]]s and is relevant for [[general-purpose computing on graphics processing units]] (GPGPU), e.g. some [[supercomputer]]s combine CPUs with GPUs: in the [[ILLIAC IV]] that CPU was a [[Burroughs_Large_Systems#B6500,_B6700/B7700,_and_successors|Burroughs B6500]].
Line 17:
 
The key difference between SIMT and [[SIMD lanes]] is that each of the Processing Units in the SIMT Array have their own local memory, and may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas the ALUs in SIMD lanes know nothing about memory per se, and have no [[register file]].
This is illustrated by the [[ILLIAC IV]]. Each SIMT core was termed a Processing Element, and each PE had its own separate Memory (PEM). Each PE had an "Index register" which was an address into its PEM.<ref>{{Cite web|url=https://www.researchgate.net/publication/2992993_The_Illiac_IV_system|title=(PDF) {{BareThe URLIlliac inline|date=JulyIV 2025system}}</ref><ref>{{Cite web| titlename=An introductory description of the Illiac IV system | url=https:"auto"//apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-date=2024-04-27}}</ref>
In the [[ILLIAC IV]] the Burroughs B6500 primarily handled I/O, but also sent instructions to the Control Unit (CU) which would then handle the broadcasting to the PEs. Additionally the B6500, in its role as an I/O processor, had access to ''all'' PEMs.
 
Line 26:
==History==
 
In [[Flynn's taxonomy]], Flynn's original papers cite two historic examples of SIMT processors termed "Array Processors": the [[ILLIAC IV#SOLOMON|SOLOMON]] and [[ILLIAC&nbsp;IV]].<ref>{{Cite web| title=An introductory description of the Illiac IV system | url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-datename=2024-04-27}}<"auto"/ref>
SIMT was introduced by [[Nvidia|NVIDIA]] in the [[Tesla (microarchitecture)|Tesla GPU microarchitecture]] with the G80 chip.<ref>{{cite web |url=http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf |title=NVIDIA Fermi Compute Architecture Whitepaper |date=2009 |website=www.nvidia.com |publisher=NVIDIA Corporation |access-date=2014-07-17}}</ref><ref name=teslaPaper>{{cite journal |title=NVIDIA Tesla: A Unified Graphics and Computing Architecture |date=2008 |page=6 {{subscription required|s}} |doi=10.1109/MM.2008.31 |volume=28 |issue=2 |journal=IEEE Micro|last1=Lindholm |first1=Erik |last2=Nickolls |first2=John |last3=Oberman |first3=Stuart |last4=Montrym |first4=John |bibcode=2008IMicr..28b..39L |s2cid=2793450 }}</ref> [[ATI Technologies]], now [[Advanced Micro Devices|AMD]], released a competing product slightly later on May 14, 2007, the [[TeraScale (microarchitecture)#TeraScale 1|TeraScale 1]]-based ''"R600"'' GPU chip.
 
Line 53:
|}
 
NVIDIA GPUs have a concept of the thread group called as "warp" composed of 32 hardware threads executed in lock-step. The equivalent in AMD GPUs is [[Graphics Core Next#Compute units|"wavefront"]], although it is composed of 64 hardware threads. In OpenCL, it is called as "sub-group" for the abstract term of warp and wavefront. CUDA also has the warp shuffle instructions which make parallel data exchange in the thread group faster,<ref>[{{Cite web|url=https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/ |title=Faster Parallel Reductions on Kepler |date=February 14, 2014|website=NVIDIA Technical Blog]}}</ref> and OpenCL allows a similar feature support by an extension cl_khr_subgroups.<ref>[{{Cite web|url=https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/cl_khr_subgroups.html |title=cl_khr_subgroups(3) Manual Page]|website=registry.khronos.org}}</ref>
 
== Open hardware SIMT processors ==
Line 65:
=== GPU Simulator ===
 
A simulator of a SIMT Architecture, GPGPU-Sim, is developed at the [[University_of_British_Columbia]] by Tor Aamodt along with his graduate students.<ref>{{Cite web | url=http://gpgpu-sim.org/ {{Bare| URLtitle=GPGPU-Sim inline|date=July 2025website=gpgpu-sim.org}}</ref>
 
=== Vortex GPU ===