Single instruction, multiple threads

Single instruction, multiple thread (SIMT) is a parallel execution model, used in some GPGPU platforms, where multithreading is simulated by SIMD processors. The processors, say a number $p$ of them, seem to be executed many more than $p$ tasks. The threads (or tasks) are in fact partitioned into blocks that map onto the processors, and these blocks execute tasks in lock-step.^[1]

SIMT was introduced by Nvidia:^[2]^[3]

[The G80 Nvidia GPU architecture] introduced the single-instruction multiple-thread (SIMT) execution model where multiple independent threads execute concurrently using a single instruction.

SIMT is intended to limit instruction fetching overhead,^[4] and is used in modern GPUs (including, but not limited to those of Nvidia and AMD) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations.^[5]

A downside of SIMT execution is the fact that thread-specific control-flow has to be performed using "masking", leading to poor utilisation where control-flow is not coherent for all threads of a processor. For instance, to handle an if-else block where various threads of a processor execute different paths, all threads must actually process both paths (as all threads of a processor always execute in lock-step), but "masking" is used to disable and enable the various threads as appropriate. This "masking" strategy is what distinguishes SIMT from ordinary SIMD, and has the benefit of inexpensive synchronization between the threads of a processor.^[1]

References

^ ^a ^b Michael McCool; James Reinders; Arch Robison (2013). Structured Parallel Programming: Patterns for Efficient Computation. Elsevier. p. 52. Cite error: The named reference "spp" was defined multiple times with different content (see the help page).
^ "Nvidia Fermi Compute Arcitecture Whitepaper" (PDF). http://www.nvidia.com/. NVIDIA Corporation. 2009. Retrieved 2014-07-17. {{cite web}}: External link in |website= (help)
^ "NVIDIA Tesla: A Unified Graphics and Computing Architecture". http://www.ieee.org/. IEEE. 2008. p. 6 (Subscription required.). Retrieved 2014-08-07. {{cite web}}: External link in |website= (help)
^ Rul, Sean; Vandierendonck, Hans; D’Haene, Joris; De Bosschere, Koen (2010). An experimental study on performance portability of OpenCL kernels. Symp. Application Accelerators in High Performance Computing (SAAHPC).
^ "Advanced Topics in CUDA" (PDF). cc.gatech.edu. 2011. Retrieved 2014-08-28.

This computer science article is a stub. You can help Wikipedia by expanding it.

[spp-1] Michael McCool; James Reinders; Arch Robison (2013). Structured Parallel Programming: Patterns for Efficient Computation. Elsevier. p. 52. Cite error: The named reference "spp" was defined multiple times with different content (see the help page).

[2] "Nvidia Fermi Compute Arcitecture Whitepaper" (PDF). http://www.nvidia.com/. NVIDIA Corporation. 2009. Retrieved 2014-07-17. {{cite web}}: External link in |website= (help)

[teslaPaper-3] "NVIDIA Tesla: A Unified Graphics and Computing Architecture". http://www.ieee.org/. IEEE. 2008. p. 6 (Subscription required.). Retrieved 2014-08-07. {{cite web}}: External link in |website= (help)

[4] Rul, Sean; Vandierendonck, Hans; D’Haene, Joris; De Bosschere, Koen (2010). An experimental study on performance portability of OpenCL kernels. Symp. Application Accelerators in High Performance Computing (SAAHPC).

[5] "Advanced Topics in CUDA" (PDF). cc.gatech.edu. 2011. Retrieved 2014-08-28.

[1]

[2]

[3]

[4]

[5]

Single instruction, multiple threads

See also

References