Revision as of 19:38, 1 August 2025 edit Rofraja (talk \| contribs) Extended confirmed users, Pending changes reviewers 47,363 edits m Replaced 0 bare URLs by {{Cite web}}; Replaced "Archived copy" by actual titles ← Previous edit		Revision as of 20:20, 1 August 2025 edit undo Rofraja (talk \| contribs) Extended confirmed users, Pending changes reviewers 47,363 edits Filled in 3 bare reference(s) with reFill 2 Next edit →
Line 5: '''Single instruction, multiple threads''' ('''SIMT''') is an execution model used in [[parallel computing]] where a single central "Control Unit" broadcasts an instruction to multiple "Processing Units" for them to all ''optionally'' perform simultaneous synchronous and fully-independent parallel execution of that one instruction. Each PU has its own independent data and address registers, its own independent Memory, but no PU in the array has a [[Program counter]]. In [[Flynn's taxonomy\|Flynn's 1972 taxonomy]] this arrangement is a variation of [[SIMD]] termed an '''array processor'''. [[Image:ILLIAC_IV.jpg\|thumb\|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971.<ref name="auto">{{Cite web\| title=An introductory description of the Illiac IV system \| url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-date=2024-04-27}}</ref>]] The SIMT execution model has been implemented on several [[GPU]]s and is relevant for [[general-purpose computing on graphics processing units]] (GPGPU), e.g. some [[supercomputer]]s combine CPUs with GPUs: in the [[ILLIAC IV]] that CPU was a [[Burroughs_Large_Systems#B6500,_B6700/B7700,_and_successors\|Burroughs B6500]]. Line 17: The key difference between SIMT and [[SIMD lanes]] is that each of the Processing Units in the SIMT Array have their own local memory, and may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas the ALUs in SIMD lanes know nothing about memory per se, and have no [[register file]]. This is illustrated by the [[ILLIAC IV]]. Each SIMT core was termed a Processing Element, and each PE had its own separate Memory (PEM). Each PE had an "Index register" which was an address into its PEM.<ref>{{Cite web\|url=https://www.researchgate.net/publication/2992993_The_Illiac_IV_system\|title=(PDF) ~~{{Bare~~The ~~URL~~Illiac ~~inline\|date=July~~IV ~~2025~~system}}</ref><ref~~>{{Cite~~ ~~web\| title~~name=~~An introductory description of the Illiac IV system \| url=https:~~"auto"/~~/apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-date=2024-04-27}}</ref~~> In the [[ILLIAC IV]] the Burroughs B6500 primarily handled I/O, but also sent instructions to the Control Unit (CU) which would then handle the broadcasting to the PEs. Additionally the B6500, in its role as an I/O processor, had access to ''all'' PEMs. Line 26: ==History== In [[Flynn's taxonomy]], Flynn's original papers cite two historic examples of SIMT processors termed "Array Processors": the [[ILLIAC IV#SOLOMON\|SOLOMON]] and [[ILLIAC IV]].<ref~~>{{Cite~~ web\| title=An introductory description of the Illiac IV system \| url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-datename=~~2024-04-27}}<~~"auto"/~~ref~~> SIMT was introduced by [[Nvidia\|NVIDIA]] in the [[Tesla (microarchitecture)\|Tesla GPU microarchitecture]] with the G80 chip.<ref>{{cite web \|url=http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf \|title=NVIDIA Fermi Compute Architecture Whitepaper \|date=2009 \|website=www.nvidia.com \|publisher=NVIDIA Corporation \|access-date=2014-07-17}}</ref><ref name=teslaPaper>{{cite journal \|title=NVIDIA Tesla: A Unified Graphics and Computing Architecture \|date=2008 \|page=6 {{subscription required\|s}} \|doi=10.1109/MM.2008.31 \|volume=28 \|issue=2 \|journal=IEEE Micro\|last1=Lindholm \|first1=Erik \|last2=Nickolls \|first2=John \|last3=Oberman \|first3=Stuart \|last4=Montrym \|first4=John \|bibcode=2008IMicr..28b..39L \|s2cid=2793450 }}</ref> [[ATI Technologies]], now [[Advanced Micro Devices\|AMD]], released a competing product slightly later on May 14, 2007, the [[TeraScale (microarchitecture)#TeraScale 1\|TeraScale 1]]-based ''"R600"'' GPU chip. Line 53: \|} NVIDIA GPUs have a concept of the thread group called as "warp" composed of 32 hardware threads executed in lock-step. The equivalent in AMD GPUs is [[Graphics Core Next#Compute units\|"wavefront"]], although it is composed of 64 hardware threads. In OpenCL, it is called as "sub-group" for the abstract term of warp and wavefront. CUDA also has the warp shuffle instructions which make parallel data exchange in the thread group faster,<ref>[{{Cite web\|url=https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/ \|title=Faster Parallel Reductions on Kepler \|date=February 14, 2014\|website=NVIDIA Technical Blog]}}</ref> and OpenCL allows a similar feature support by an extension cl_khr_subgroups.<ref>[{{Cite web\|url=https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/cl_khr_subgroups.html \|title=cl_khr_subgroups(3) ~~Manual Page]~~\|website=registry.khronos.org}}</ref> == Open hardware SIMT processors == Line 65: === GPU Simulator === A simulator of a SIMT Architecture, GPGPU-Sim, is developed at the [[University_of_British_Columbia]] by Tor Aamodt along with his graduate students.<ref>{{Cite web \| url=http://gpgpu-sim.org/ ~~{{Bare~~\| ~~URL~~title=GPGPU-Sim ~~inline~~\|~~date=July~~ ~~2025~~website=gpgpu-sim.org}}</ref> === Vortex GPU ===

Single instruction, multiple threads: Difference between revisions