Single instruction, multiple data: Difference between revisions

Content deleted Content added
SIMD on the web: Be a bit more explicit about targeting WebAssembly.
m Removing link(s) Wikipedia:Articles for deletion/Permute instruction closed as soft delete (XFDcloser)
 
(27 intermediate revisions by 7 users not shown)
Line 1:
{{Short description|Type of parallel processing}}
{{Redirect|SIMD|the cryptographic hash function|SIMD (hash function)|the Scottish statistical tool|Scottish index of multiple deprivation}}
{{See also|SIMD within a register|Single instruction, multiple threads}}
{{Update|inaccurate=yes|date=March 2017}}
{{Flynn's Taxonomy}}
{{Update|inaccurate=yes|date=March 2017}}
 
[[File:SIMD2.svg|thumb|Single instruction, multiple data]]
 
'''Single instruction, multiple data''' ('''SIMD''') is a type of [[parallel computing]] (processing) in [[Flynn's taxonomy]]. SIMD describes computers with [[multiple processing elements]] that perform the same operation on multiple data points simultaneously. SIMD can be internal (part of the hardware design) and it can be directly accessible through an [[instruction set architecture]] (ISA), but it should not be confused with an ISA.
 
Such machines exploit [[Data parallelism|data level parallelism]], but not [[Concurrent computing|concurrency]]: there are simultaneous (parallel) computations, but each unit performs exactly the same instruction at any given moment (just with different data). A simple example is to add many pairs of numbers together, all of the SIMD units are performing an addition, but each one has different pairs of values to add. SIMD is especially applicable to common tasks such as adjusting the contrast in a [[digital image]] or adjusting the volume of [[digital audio]]. Most modern [[central processing unit]] (CPU) designs include SIMD instructions to improve the performance of [[multimedia]] use. In recent CPUs, SIMD units are tightly coupled with cache hierarchies and prefetch mechanisms, which minimize latency during large block operations. For instance, AVX-512-enabled processors can prefetch entire cache lines and apply fused multiply-add operations (FMA) in a single SIMD cycle.
 
== Confusion between SIMT and SIMD ==
SIMD has three different subcategories in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's 1972 Taxonomy]], one of which is [[single instruction, multiple threads]] (SIMT).{{clarify|reason=The term SIMT does not appear in Flynn's 1972 paper. Its three subcategories are the Array Processor, Pipelined Processor, and Associative Processor. Which of these is meant here? Note that Wikipedia is not in itself a "reliable source".|date=June 2025}} SIMT should not be confused with [[Thread (computing)|software threads]] or [[Multithreading (computer architecture)|hardware threads]], both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution. A key distinction in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]] terminology) or wavefronts (Advanced Micro Devices ([[AMD]]) terminology). These allow divergence and convergence of threads, even under shared instruction streams, thereby offering slightly more flexibility than classical SIMD.{{clarify|reason=Is classical SIMD one of the subcategories in Flynn's 1972 paper? If so, which subcategory?|date=July 2025}}
{{See also|SIMD within a register|Single instruction, multiple threads|Vector processor}}
 
[[Image:ILLIAC_IV.jpg|thumb|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971<ref>{{Cite web | title=Archived copy | url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-date=2024-04-27}}</ref>]]
Each hardware element (PU) working on individual data item sometimes also referred as SIMD lane or channel. Modern [[graphics processing unit]]s (GPUs) are often wide SIMD (typically >16 data lanes or channel) implementations.{{cn|date=July 2024}} Some newer GPUs go beyond simple SIMD and integrate mixed-precision SIMD pipelines, which allow concurrent execution of [[8-bit computing|8-bit]], [[16-bit computing|16-bit]], and [[32-bit computing|32-bit]] operations in different lanes. This is critical for applications like AI inference, where mixed precision boosts throughput.
 
SIMD has three different subcategories in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's 1972 Taxonomy]], one of which is [[single instruction, multiple threads]] (SIMT).{{clarify|reason=The term SIMT does not appear in Flynn's 1972 paper. Its three subcategories are the Array Processor, Pipelined Processor, and Associative Processor. Which of these is meant here? Note that Wikipedia is not in itself a "reliable source".|date=June 2025}} SIMT should not be confused with [[Thread (computing)|software threads]] or [[Multithreading (computer architecture)|hardware threads]], both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution., Asuch key distinctionas in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]]ILLIAC terminology) or wavefronts (Advanced Micro Devices ([[AMDIV]]) terminology). These allow divergence and convergence of threads, even under shared instruction streams, thereby offering slightly more flexibility than classical SIMD.{{clarify|reason=Is classical SIMD one of the subcategories in Flynn's 1972 paper? If so, which subcategory?|date=July 2025}}
Additionally, SIMD can exist in both fixed and scalable vector forms. Fixed-width SIMD units operate on a constant number of data points per instruction, while scalable designs, like RISC-V Vector or ARM's SVE, allow the number of data elements to vary depending on the hardware implementation. This improves forward compatibility across generations of processors.
 
SIMD should not be confused with [[Vector processing]], characterized by the [[Cray 1]] and clarified in [[Duncan's taxonomy]]. The
[[Vector processor#Difference between SIMD and vector processors|difference between SIMD and vector processors]] is primarily the presence of a Cray-style {{code|SET VECTOR LENGTH}} instruction.
 
One key distinction between SIMT and SIMD is that the SIMD unit will not have its own memory.
Another key distinction in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]] terminology) or wavefronts (Advanced Micro Devices ([[AMD]]) terminology). [[ILLIAC IV]] simply called them "Control Signals". These signals ensure that each Processing Element in the entire parallel array is synchronized in its simultaneous execution of the (one, current) broadcast instruction.
 
Each hardware element (PU, or PE in [[ILLIAC IV]] terminology) working on individual data item sometimes also referred to as a [[SIMD lane]] or channel. The ILLIAC IV PE was a scalar 64-bit unit that could do 2x32-bit [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|predication]]. Modern [[graphics processing unit]]s (GPUs) are ofteninvariably wide [[SIMD within a register]] (SWAR) and typically >have more that 16 data lanes or channel)channels of such Processing implementationsElements.{{cn|date=July 2024}} Some newer GPUs go beyond simple SIMD and integrate mixed-precision SIMD{{cn|date=July 2025}} SWAR pipelines, which allowperforms concurrent execution ofsub-word [[8-bit computing|8-bit]], [[16-bit computing|16-bit]], and [[32-bit computing|32-bit]] operations in different lanes. This is critical for applications like AI inference, where mixed precision boosts throughput.
 
==History==
The first known operational use to date of [[SIMD within a register]] was the [[TX-2]], in 1958. It was capable of 36-bit operations and two 18-bit or four 9-bit sub-word operations.

The first commercial use of SIMD instructions was in the [[ILLIAC IV]], which was completed in 1972. This included 64 (of an original design of 256) processors that had local memory to hold different values while performing the same instruction. Separate hardware quickly sent out the values to be processed and gathered up the results.
 
SIMD was the basis for [[vector processor|vectorVector supercomputers]] of the early 1970s such as the [[CDC STAR-100|CDC Star-100]] and the [[TI Advanced Scientific Computer|Texas Instruments ASC]], which could operate on a "vector" of data with a single instruction. Vector processing was especially popularized by [[Cray]] in the 1970s and 1980s. Vector processing architectures are now considered separate from SIMD computers: [[Duncan's Taxonomy]] includes them whereas [[Flynn's Taxonomy]] does not, due to Flynn's work (1966, 1972) pre-dating the [[Cray-1]] (1977). The complexity of Vector processors however inspired a simpler arrangement known as [[SIMD within a register]].
 
The first era of modern SIMD computers was characterized by [[massively parallel processing]]-style [[supercomputer]]s such as the [[Thinking Machines Corporation|Thinking Machines]] [[Connection Machine]] CM-1 and CM-2. These computers had many limited-functionality processors that would work in parallel. For example, each of 65,536 single-bit processors in a Thinking Machines CM-2 would execute the same instruction at the same time, allowing, for instance, to logically combine 65,536 pairs of bits at a time, using a hypercube-connected network or processor-dedicated RAM to find its operands. Supercomputing moved away from the SIMD approach when inexpensive scalar [[multiple instruction, multiple data]] (MIMD) approaches based on commodity processors such as the [[Intel i860|Intel i860 XP]] became more powerful, and interest in SIMD waned.<ref>{{cite web|url=http://www.cs.kent.edu/~walker/classes/pdc.f01/lectures/MIMD-1.pdf|title=MIMD1 - XP/S, CM-5}}</ref>
Line 42 ⟶ 54:
* Programming with given SIMD instruction sets can involve many low-level challenges.
*# SIMD may have restrictions on [[Data structure alignment|data alignment]]; programmers familiar with a given architecture may not expect this. Worse: the alignment may change from one revision or "compatible" processor to another.
*# Gathering data into SIMD registers and scattering it to the correct destination locations is tricky (sometimes requiring [[permute instruction]]sinstructions (operations) and can be inefficient.
*# Specific instructions like rotations or three-operand addition are not available in some SIMD instruction sets.
*# Instruction sets are architecture-specific: some processors lack SIMD instructions entirely, so programmers must provide non-vectorized implementations (or different vectorized implementations) for them.
Line 48 ⟶ 60:
*# The early [[MMX (instruction set)|MMX]] instruction set shared a register file with the floating-point stack, which caused inefficiencies when mixing floating-point and MMX code. However, [[SSE2]] corrects this.
 
To remedy problems 1 and 5, Cray-style [[RISC-VVector processors]]'s vector extension usesuse an alternative approach: instead of exposing the sub-register-level details directly to the programmer, the instruction set abstracts them out asat aleast fewthe "vectorlength registers"(number thatof useelements) theinto samea interfacesruntime acrosscontrol allregister, CPUsusually withnamed this"VL" instruction(Vector setLength). The hardware then handles all alignment issues and "strip-mining" of loops. Machines with different vector sizes would be able to run the same code. LLVM calls this vector type "{{not a typo|vscale}}".{{citation needed|date=June 2021}}
 
AnWith SIMD, an order of magnitude increase in code size is not uncommon, when compared to equivalent scalar or equivalent vector code, and an order of magnitude ''or greater'' effectiveness (work done per instruction) is achievable with Vector ISAs.<ref>{{cite web |last1=Patterson |first1=David |last2=Waterman |first2=Andrew |title=SIMD Instructions Considered Harmful |url=https://www.sigarch.org/simd-instructions-considered-harmful/ |website=SIGARCH |date=18 September 2017}}</ref>
 
ARM's [[Scalable Vector Extension]] takes another approach, known in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's Taxonomy]] as "Associative Processing", more commonly known today as [[Predication (computer architecture)#SIMD, SIMT and vector predication|"Predicated" (masked)]] SIMD. This approach is not as compact as [[vector processing]] but is still far better than non-predicated SIMD. Detailed comparative examples are given at {{section link|Vector processor|Vector instruction example}}. In addition, all versions of the ARM architecture have offered Load and Store multiple instructions, to Load or Store a block of data from a continuous block of memory, into a range or non-continuous set of registers.<ref>{{Cite web |title=ARM LDR/STR, LDM/STM instructions - Programmer All |url=https://programmerall.com/article/2483661565/ |access-date=2025-04-19 |website=programmerall.com}}</ref>
 
==Chronology==
Line 60 ⟶ 72:
! Year !! Example
|-
| 1974 || [[ILLIAC IV]] - an Array Processor comprising scalar 64-bit PEs
|-
| 1974 || [[ICL Distributed Array Processor]] (DAP)
Line 114 ⟶ 126:
* Library multi-versioning (LMV): the entire [[Library (computing)|programming library]] is duplicated for many instruction set extensions, and the operating system or the program decides which one to load at run-time.
 
FMV, manually coded in assembly language, is quite commonly used in a number of performance-critical libraries such as glibc and libjpeg-turbo. [[Intel C++ Compiler]], [[GNU Compiler Collection]] since GCC 6, and [[Clang]] since clang 7 allow for a simplified approach, with the compiler taking care of function duplication and selection. GCC and clang requires explicit {{code|target_clones}} labels in the code to "clone" functions,<ref>{{cite web |title=Function multi-versioning in GCC 6 |url=https://lwn.net/Articles/691932/ |website=lwn.net |date=22 June 2016 }}</ref> while ICC does so automatically (under the command-line option {{code|/Qax}}). The [[Rust programming language]] also supports FMV. The setup is similar to GCC and Clang in that the code defines what instruction sets to compile for, but cloning is manually done via inlining.<ref>{{cite web |title=2045-target-feature |url= https://rust-lang.github.io/rfcs/2045-target-feature.html |website=The Rust RFC Book}}</ref>
 
As using FMV requires code modification on GCC and Clang, vendors more commonly use library multi-versioning: this is easier to achieve as only compiler switches need to be changed. [[Glibc]] supports LMV and this functionality is adopted by the Intel-backed Clear Linux project.<ref name=clear>{{cite web |title=Transparent use of library packages optimized for Intel® architecture |url=https://clearlinux.org/news-blogs/transparent-use-library-packages-optimized-intel-architecture |website=Clear Linux* Project |access-date=8 September 2019 |language=en}}</ref>
Line 125 ⟶ 137:
Instances of these types are immutable and in optimized code are mapped directly to SIMD registers. Operations expressed in Dart typically are compiled into a single instruction without any overhead. This is similar to C and C++ intrinsics. Benchmarks for [[4×4 matrix|4×4]] [[matrix multiplication]], [[3D vertex transformation]], and [[Mandelbrot set]] visualization show near 400% speedup compared to scalar code written in Dart.
 
Intel announced at IDF 2013 that they were implementing McCutchan's specification for both [[V8 (JavaScript engine)|V8]] and [[SpiderMonkey]].<ref>{{cite web |title=SIMD in JavaScript |url=https://01.org/node/1495 |website=01.org |date=8 May 2014}}</ref> However, by 2017, SIMD.js was taken out of the [[ECMAScript]] standard queue in favor of pursuing a similar interface in [[WebAssembly]].<ref>{{cite web |title=tc39/ecmascript_simd: SIMD numeric type for EcmaScript. |url=https://github.com/tc39/ecmascript_simd/ |website=GitHub |publisher=Ecma TC39 |access-date=8 September 2019 |date=22 August 2019}}</ref> Support for SIMD was added to the WebAssembly 2.0 specification, which was finished on 2022 and became official on December 2024.<ref>{{cite web |url=https://webassembly.org/news/2025-03-20-wasm-2.0/ |title=Wasm 2.0 Completed - WebAssembly}}</ref> LLVM's autovectoring, when compiling C or C++ to WebAssembly, can target WebAssembly SIMD to automatically make use of SIMD, while SIMD intrinsic are also available.<ref>{{cite web |title=Using SIMD with WebAssembly |url=https://emscripten.org/docs/porting/simd.html |eotkwork=Emscripten 4.0.11-git (dev) documentation}}</ref>
 
==Commercial applications==