Content deleted Content added
m Array processor redirects here |
m Replaced 8 bare URLs by {{Cite web}} |
||
Line 5:
'''Single instruction, multiple threads''' ('''SIMT''') is an execution model used in [[parallel computing]] where a single central "Control Unit" broadcasts an instruction to multiple "Processing Units" for them to all ''optionally'' perform simultaneous synchronous and fully-independent parallel execution of that one instruction. Each PU has its own independent data and address registers, its own independent Memory, but no PU in the array a [[Program counter]]. In [[Flynn's taxonomy|Flynn's 1972 taxonomy]] this arrangement is a variation of [[SIMD]] termed an '''array processor'''.
[[Image:ILLIAC_IV.jpg|thumb|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971.<ref>{{Cite web | title=Archived copy | url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf
The SIMT execution model has been implemented on several [[GPU]]s and is relevant for [[general-purpose computing on graphics processing units]] (GPGPU), e.g. some [[supercomputer]]s combine CPUs with GPUs. In the [[ILLIAC IV]] the CPU was a [[Burroughs_Large_Systems#B6500,_B6700/B7700,_and_successors|Burroughs B6500]].
Line 14:
The key difference between SIMT and [[SIMD lanes]] is that each of the Processing Units in the SIMT Array have their own local memory, and may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas the ALUs in SIMD lanes know nothing about memory per se, and have no [[register file]].
This is illustrated by the [[ILLIAC IV]]. Each SIMT core was termed a Processing Element, and each PE had its own separate Memory. Each PE had an "Index register" which was an address into its PEM.<ref>https://www.researchgate.net/publication/2992993_The_Illiac_IV_system {{Bare URL inline|date=July 2025}}</ref><ref>{{Cite web | title=Archived copy | url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf
In the [[ILLIAC IV]] the Burroughs B6500 primarily handled I/O, but also sent instructions to the Control Unit (CU) which would then handle the broadcasting to the PEs. Additionally the B6500, in its role as an I/O processor, had access to ''all'' PEMs.
Additionally, each PE may be made active or inactive. If a given PE is inactive it will not execute the instruction broadcast to it by the Control Unit: instead it will sit idle until activated. Each PE can be said to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_Vector_Predication|Predicated]].
The SIMT execution model is still only a way to present to the programmer what is fundamentally still a Predicated SIMD concept. Programs must be designed with Predicated SIMD in mind. With Instruction Issue (as a synchronous broadcast) being handled by the single Control Unit, SIMT cannot ''by design'' allow threads (PEs, Lanes) to diverge by branching, because only the Control Unit has a Program Counter. If possible, therefore, branching is to be avoided.<ref>{{Cite web | title=SIMT Model - Open Source General-Purpose Computing Chip Platform - Blue Porcelain(GPGPU) | url=https://gpgpuarch.org/en/basic/simt/
<ref>{{Cite web | title=General-Purpose Graphics Processor Architecture - Chapter 3 - The SIMT Core: Instruction and Register Data Flow (Part 1) {{!}} FANnotes | url=https://www.fannotes.me/article/gpgpu_architecture/chapter_3_the_simt_core_instruction_and_register_data_flow_part_1
Also important to note is the difference between SIMT and [[SPMD]] - Single Program Multiple Data. SPMD, like standard multi-core systems, has multiple Program Counters, where SIMT only has one: in the (one) Control Unit.
Line 26:
==History==
In [[Flynn's taxonomy]], Flynn's original papers cite two historic examples of SIMT processors termed "Array Processors": the [[ILLIAC IV#SOLOMON|SOLOMON]] and [[ILLIAC IV]].<ref>{{Cite web | title=Archived copy | url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf
SIMT was introduced by [[Nvidia|NVIDIA]] in the [[Tesla (microarchitecture)|Tesla GPU microarchitecture]] with the G80 chip.<ref>{{cite web |url=http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf |title=NVIDIA Fermi Compute Architecture Whitepaper |date=2009 |website=www.nvidia.com |publisher=NVIDIA Corporation |access-date=2014-07-17}}</ref><ref name=teslaPaper>{{cite journal |title=NVIDIA Tesla: A Unified Graphics and Computing Architecture |date=2008 |page=6 {{subscription required|s}} |doi=10.1109/MM.2008.31 |volume=28 |issue=2 |journal=IEEE Micro|last1=Lindholm |first1=Erik |last2=Nickolls |first2=John |last3=Oberman |first3=Stuart |last4=Montrym |first4=John |s2cid=2793450 }}</ref> [[ATI Technologies]], now [[Advanced Micro Devices|AMD]], released a competing product slightly later on May 14, 2007, the [[TeraScale (microarchitecture)#TeraScale 1|TeraScale 1]]-based ''"R600"'' GPU chip.
Line 56:
=== MIAOW GPU ===
[[File:MIAOW_GPU_diagram.png|thumb|MIAOW GPU and associated Computation Unit block diagram.<ref>{{Cite web | title=Architecture · VerticalResearchGroup/miaow Wiki · GitHub | url=https://github.com/VerticalResearchGroup/miaow/wiki/Architecture
The MIAOW Project by the Vertical Research Group is an implementation of AMDGPU "Southern Islands".<ref>{{Cite web | title=Vertical Research Group {{!}} Main / Projects | url=https://research.cs.wisc.edu/vertical/wiki/index.php/Main/Projects#miaow
=== GPU Simulator ===
Line 68:
[[File:Vortex microarchitecture.png|thumb|Vortex SIMT GPU Microarchitecture diagram]]
The Vortex GPU is an Open Source [[GPGPU]] project by [[Georgia Tech University]] that runs [[OpenCL]]. Technical details:<ref>{{Cite web | title=vortex/docs/microarchitecture.md at master · vortexgpgpu/vortex · GitHub | url=https://github.com/vortexgpgpu/vortex/blob/master/docs/microarchitecture.md
Note a key defining characteristics of SIMT: the ''PC is shared''. However note also that time-multiplexing is used, giving the impression that it has more Array Processing Elements than there actually are.
|