Single instruction, multiple data: Difference between revisions

Content deleted Content added
WP:LINKs: update-standardizes, needless: WP:PIPEs > WP:NOPIPEs, adds. MOS:FIRSTABBReviations define-clarify before parenthetic WP:ABBRs. Small WP:COPYEDITs WP:EoS: WP:TERSE, clarify. Nonacronym proper noun MOS:ALLCAPS > WP:LOWERCASE sentence case.
m Removing link(s) Wikipedia:Articles for deletion/Permute instruction closed as soft delete (XFDcloser)
 
(34 intermediate revisions by 9 users not shown)
Line 1:
{{Short description|Type of parallel processing}}
{{Redirect|SIMD|the cryptographic hash function|SIMD (hash function)|the Scottish statistical tool|Scottish index of multiple deprivation}}
{{See also|SIMD within a register|Single instruction, multiple threads}}
{{Update|inaccurate=yes|date=March 2017}}
{{Flynn's Taxonomy}}
{{Update|inaccurate=yes|date=March 2017}}
 
[[File:SIMD2.svg|thumb|Single instruction, multiple data]]
 
'''Single instruction, multiple data''' ('''SIMD''') is a type of [[parallel computing]] (processing) in [[Flynn's taxonomy]]. SIMD describes computers with [[multiple processing elements]] that perform the same operation on multiple data points simultaneously. SIMD can be internal (part of the hardware design) and it can be directly accessible through an [[instruction set architecture]] (ISA), but it should not be confused with an ISA.
 
Such machines exploit [[Data parallelism|data level parallelism]], but not [[Concurrent computing|concurrency]]: there are simultaneous (parallel) computations, but each unit performs exactly the same instruction at any given moment (just with different data). A simple example is to add many pairs of numbers together, all of the SIMD units are performing an addition, but each one has different pairs of values to add. SIMD is especially applicable to common tasks such as adjusting the contrast in a [[digital image]] or adjusting the volume of [[digital audio]]. Most modern [[central processing unit]] (CPU) designs include SIMD instructions to improve the performance of [[multimedia]] use. In recent CPUs, SIMD units are tightly coupled with cache hierarchies and prefetch mechanisms, which minimize latency during large block operations. For instance, AVX-512-enabled processors can prefetch entire cache lines and apply fused multiply-add operations (FMA) in a single SIMD cycle.
 
== Confusion between SIMT and SIMD ==
SIMD has three different subcategories in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's 1972 Taxonomy]], one of which is [[single instruction, multiple threads]] (SIMT). SIMT should not be confused with [[Thread (computing)|software threads]] or [[Multithreading (computer architecture)|hardware threads]], both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution. A key distinction in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]] terminology) or wavefronts (Advanced Micro Devices ([[AMD]]) terminology). These allow divergence and convergence of threads, even under shared instruction streams, thereby offering slightly more flexibility than classical SIMD.
{{See also|SIMD within a register|Single instruction, multiple threads|Vector processor}}
 
[[Image:ILLIAC_IV.jpg|thumb|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971<ref>{{Cite web | title=Archived copy | url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-date=2024-04-27}}</ref>]]
Each hardware element (PU) working on individual data item sometimes also referred as SIMD lane or channel. Modern [[graphics processing unit]]s (GPUs) are often wide SIMD (typically >16 data lanes or channel) implementations.{{cn|date=July 2024}} Some newer GPUs go beyond simple SIMD and integrate mixed-precision SIMD pipelines, which allow concurrent execution of [[8-bit computing|8-bit]], [[16-bit computing|16-bit]], and [[32-bit computing|32-bit]] operations in different lanes. This is critical for applications like AI inference, where mixed precision boosts throughput.
 
SIMD has three different subcategories in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's 1972 Taxonomy]], one of which is [[single instruction, multiple threads]] (SIMT). SIMT should not be confused with [[Thread (computing)|software threads]] or [[Multithreading (computer architecture)|hardware threads]], both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution., Asuch key distinctionas in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]]ILLIAC terminology) or wavefronts (Advanced Micro Devices ([[AMDIV]]) terminology). These allow divergence and convergence of threads, even under shared instruction streams, thereby offering slightly more flexibility than classical SIMD.
Additionally, SIMD can exist in both fixed and scalable vector forms. Fixed-width SIMD units operate on a constant number of data points per instruction, while scalable designs, like RISC-V Vector or ARM's SVE, allow the number of data elements to vary depending on the hardware implementation. This improves forward compatibility across generations of processors.
 
SIMD should not be confused with [[Vector processing]], characterized by the [[Cray 1]] and clarified in [[Duncan's taxonomy]]. The
[[Vector processor#Difference between SIMD and vector processors|difference between SIMD and vector processors]] is primarily the presence of a Cray-style {{code|SET VECTOR LENGTH}} instruction.
 
One key distinction between SIMT and SIMD is that the SIMD unit will not have its own memory.
Another key distinction in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]] terminology) or wavefronts (Advanced Micro Devices ([[AMD]]) terminology). [[ILLIAC IV]] simply called them "Control Signals". These signals ensure that each Processing Element in the entire parallel array is synchronized in its simultaneous execution of the (one, current) broadcast instruction.
 
Each hardware element (PU, or PE in [[ILLIAC IV]] terminology) working on individual data item sometimes also referred to as a [[SIMD lane]] or channel. The ILLIAC IV PE was a scalar 64-bit unit that could do 2x32-bit [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|predication]]. Modern [[graphics processing unit]]s (GPUs) are ofteninvariably wide [[SIMD within a register]] (SWAR) and typically >have more that 16 data lanes or channel)channels of such Processing implementationsElements.{{cn|date=July 2024}} Some newer GPUs go beyond simple SIMD and integrate mixed-precision SIMD{{cn|date=July 2025}} SWAR pipelines, which allowperforms concurrent execution ofsub-word [[8-bit computing|8-bit]], [[16-bit computing|16-bit]], and [[32-bit computing|32-bit]] operations in different lanes. This is critical for applications like AI inference, where mixed precision boosts throughput.
 
==History==
The first known operational use to date of [[SIMD within a register]] was the [[TX-2]], in 1958. It was capable of 36-bit operations and two 18-bit or four 9-bit sub-word operations.

The first commercial use of SIMD instructions was in the [[ILLIAC IV]], which was completed in 1972. This included 64 (of an original design of 256) processors that had local memory to hold different values while performing the same instruction. Separate hardware quickly sent out the values to be processed and gathered up the results.
 
SIMD was the basis for [[vector processor|vectorVector supercomputers]] of the early 1970s such as the [[CDC STAR-100|CDC Star-100]] and the [[TI Advanced Scientific Computer|Texas Instruments ASC]], which could operate on a "vector" of data with a single instruction. Vector processing was especially popularized by [[Cray]] in the 1970s and 1980s. Vector processing architectures are now considered separate from SIMD computers: [[Duncan's Taxonomy]] includes them whereas [[Flynn's Taxonomy]] does not, due to Flynn's work (1966, 1972) pre-dating the [[Cray-1]] (1977). The complexity of Vector processors however inspired a simpler arrangement known as [[SIMD within a register]].
 
The first era of modern SIMD computers was characterized by [[massively parallel]] processing]]-style [[supercomputer]]s such as the [[Thinking Machines Corporation|Thinking Machines]] [[Connection Machine]] CM-1 and CM-2. These computers had many limited-functionality processors that would work in parallel. For example, each of 65,536 single-bit processors in a Thinking Machines CM-2 would execute the same instruction at the same time, allowing, for instance, to logically combine 65,536 pairs of bits at a time, using a hypercube-connected network or processor-dedicated RAM to find its operands. Supercomputing moved away from the SIMD approach when inexpensive scalar [[multiple instruction, multiple data]] (MIMD) approaches based on commodity processors such as the [[Intel i860|Intel i860 XP]] became more powerful, and interest in SIMD waned.<ref>{{cite web|url=http://www.cs.kent.edu/~walker/classes/pdc.f01/lectures/MIMD-1.pdf|title=MIMD1 - XP/S, CM-5}}</ref>
 
The current era of SIMD processors grew out of the desktop-computer market rather than the supercomputer market. As desktop processors became powerful enough to support real-time gaming and audio/video processing during the 1990s, demand grew for this type of computing power, and microprocessor vendors turned to SIMD to meet the demand.<ref name="conte">{{cite conference |title=The long and winding road to high-performance image processing with MMX/SSE |first1=G. |last1=Conte |first2=S. |last2=Tommesani |first3=F. |last3=Zanichelli |book-title=Proc. Fifth IEEE Int'l Workshop on Computer Architectures for Machine Perception |year=2000 |doi=10.1109/CAMP.2000.875989 |s2cid=13180531 |hdl=11381/2297671}}</ref> This resurgence also coincided with the rise of [[DirectX]] and OpenGL shader models, which heavily leveraged SIMD under the hood. The graphics APIs encouraged programmers to adopt data-parallel programming styles, indirectly accelerating SIMD adoption in desktop software. Hewlett-Packard introduced [[Multimedia Acceleration eXtensions]] (MAX) instructions into [[PA-RISC]] 1.1 desktops in 1994 to accelerate MPEG decoding.<ref>{{cite book |first=R.B. |last=Lee |chapter=Realtime MPEG video via software decompression on a PA-RISC processor |title=digest of papers Compcon '95. Technologies for the Information Superhighway |year=1995 |pages=186–192 |doi=10.1109/CMPCON.1995.512384 |isbn=0-8186-7029-0|s2cid=2262046}}</ref> Sun Microsystems introduced SIMD integer instructions in its "[[Visual Instruction Set|VIS]]" instruction set extensions in 1995, in its [[UltraSPARC|UltraSPARC I]] microprocessor. MIPS followed suit with their similar [[MDMX]] system.
Line 30 ⟶ 42:
 
==Advantages==
An application that may take advantage of SIMD is one where the same value is being added to (or subtracted from) a large number of data points, a common operation in many [[multimedia]] applications. One example would be changing the brightness of an image. Each [[pixel]] of an image consists of three values for the brightness of the red (R), green (G) and blue (B) portions of the color. To change the brightness, the R, G and B values are read from memory, a value is added to (or subtracted from) them, and the resulting values are written back out to memory. Audio [[Digitaldigital signal processing|DSPprocessor]]s (DSPs) would likewise, for volume control, multiply both Left and Right channels simultaneously.
 
With a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "retrieve this pixel, now retrieve the next pixel", a SIMD processor will have a single instruction that effectively says "retrieve n pixels" (where n is a number that varies from design to design). For a variety of reasons, this can take much less time than retrieving each pixel individually, as with a traditional CPU design. Moreover, SIMD instructions can exploit data reuse, where the same operand is used across multiple calculations, via broadcasting features. For example, multiplying several pixels by a constant scalar value can be done more efficiently by loading the scalar once and broadcasting it across a SIMD register.
Line 42 ⟶ 54:
* Programming with given SIMD instruction sets can involve many low-level challenges.
*# SIMD may have restrictions on [[Data structure alignment|data alignment]]; programmers familiar with a given architecture may not expect this. Worse: the alignment may change from one revision or "compatible" processor to another.
*# Gathering data into SIMD registers and scattering it to the correct destination locations is tricky (sometimes requiring [[permute instruction]]sinstructions (operations) and can be inefficient.
*# Specific instructions like rotations or three-operand addition are not available in some SIMD instruction sets.
*# Instruction sets are architecture-specific: some processors lack SIMD instructions entirely, so programmers must provide non-vectorized implementations (or different vectorized implementations) for them.
Line 48 ⟶ 60:
*# The early [[MMX (instruction set)|MMX]] instruction set shared a register file with the floating-point stack, which caused inefficiencies when mixing floating-point and MMX code. However, [[SSE2]] corrects this.
 
To remedy problems 1 and 5, Cray-style [[RISC-VVector processors]]'s vector extension usesuse an alternative approach: instead of exposing the sub-register-level details directly to the programmer, the instruction set abstracts them out asat aleast fewthe "vectorlength registers"(number thatof useelements) theinto samea interfacesruntime acrosscontrol allregister, CPUsusually withnamed this"VL" instruction(Vector setLength). The hardware then handles all alignment issues and "strip-mining" of loops. Machines with different vector sizes would be able to run the same code. LLVM calls this vector type "{{not a typo|vscale}}".{{citation needed|date=June 2021}}
 
AnWith SIMD, an order of magnitude increase in code size is not uncommon, when compared to equivalent scalar or equivalent vector code, and an order of magnitude ''or greater'' effectiveness (work done per instruction) is achievable with Vector ISAs.<ref>{{cite web |last1=Patterson |first1=David |last2=Waterman |first2=Andrew |title=SIMD Instructions Considered Harmful |url=https://www.sigarch.org/simd-instructions-considered-harmful/ |website=SIGARCH |date=18 September 2017}}</ref>
 
ARM's [[Scalable Vector Extension]] takes another approach, known in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's Taxonomy]] as "Associative Processing", more commonly known today as [[Predication (computer architecture)#SIMD, SIMT and vector predication|"Predicated" (masked)]] SIMD. This approach is not as compact as [[Vectorvector processing]] but is still far better than non-predicated SIMD. Detailed comparative examples are given inat the{{section [[link|Vector processor#|Vector instruction example|Vector processing]] page}}. In addition, all versions of the ARM architecture have offered Load and Store multiple instructions, to Load or Store a block of data from a continuous block of memory, into a range or non-continuous set of registers.<ref>{{Cite web |title=ARM LDR/STR, LDM/STM instructions - Programmer All |url=https://programmerall.com/article/2483661565/ |access-date=2025-04-19 |website=programmerall.com}}</ref>
 
==Chronology==
{| class="wikitable"
|+ Examples of SIMD supercomputerssupercomputer (notexamples includingexcluding [[vector processor]]s)
|-
! Year !! Example
|-
| 1974 || [[ILLIAC IV]] - an Array Processor comprising scalar 64-bit PEs
|-
| 1974 || [[ICL Distributed Array Processor]] (DAP)
Line 68 ⟶ 80:
| 1981 || [[Geometric-Arithmetic Parallel Processor]] from [[Martin Marietta]] (continued at [[Lockheed Martin]], then at [http://www.teranex.com Teranex] and [[Silicon Optix]])
|-
| 1983-19911983–1991 || [[Goodyear MPP|Massively Parallel Processor]] (MPP), from [[NASA]]/[[Goddard Space Flight Center]]
|-
| 1985 || [[Connection Machine]], models 1 and 2 (CM-1 and CM-2), from [[Thinking Machines Corporation]]
|-
| 1987-19961987–1996 || [[MasPar]] MP-1 and MP-2
|-
| 1991 || [[Zephyr DC]] from [[Wavetracer]]
|-
| 2001 || [[Xplor (Pyxsys)|Xplor]] from [[Pyxsys, Inc.]]
|}
 
==Hardware==
Small-scale (64 or 128 bits) SIMD became popular on general-purpose CPUs in the early 1990s and continued through 1997 and later with Motion Video Instructions (MVI) for [[DEC Alpha|Alpha]]. SIMD instructions can be found, to one degree or another, on most CPUs, including [[IBM]]'s [[AltiVec]] and [[Signal Processing Engine| (SPE]]) for [[PowerPC]], [[Hewlett-Packard|HP]]'s (HP) [[PA-RISC]] [[Multimedia Acceleration eXtensions]] (MAX), [[Intel]]'s [[MMX (instruction set)|MMX and iwMMXt]], [[Streaming SIMD Extensions]] (SSE), [[SSE2]], [[SSE3]] [[SSSE3]] and [[SSE4]].x, [[Advanced Micro Devices|AMD]]'s [[3DNow!]], [[ARC (processor)|ARC's]] ARC Video subsystem, [[SPARC]]'s [[Visual Instruction Set|VIS]] and VIS2, [[Sun Microsystems|Sun]]'s [[MAJC]], [[ARM Holdings|ARM's]] [[ARM architecture#Advanced SIMD (Neon)|Neon]] technology, [[MIPS architecture|MIPS]]' [[MDMX]] (MaDMaX) and [[MIPS-3D]]. The IBM, Sony, Toshiba co-developed [[Cell (processor)|Cell processor's]] [[Cell (processor)#Synergistic Processing Element (SPE)|Synergistic Processing Element's]] (SPE's) instruction set is heavily SIMD based. [[Philips]], now [[NXP Semiconductors|NXP]], developed several SIMD processors named [[Xetal]]. The Xetal has 320 16-bit processor elements especially designed for vision tasks. Apple's M1 and M2 chips also incorporate SIMD units deeply integrated with their GPU and Neural Engine, using Apple-designed SIMD pipelines optimized for image filtering, convolution, and matrix multiplication. This unified memory architecture helps SIMD instructions operate on shared memory pools more efficiently.
 
Intel's [[AVX-512]] SIMD instructions process 512 bits of data at once.
Line 114 ⟶ 126:
* Library multi-versioning (LMV): the entire [[Library (computing)|programming library]] is duplicated for many instruction set extensions, and the operating system or the program decides which one to load at run-time.
 
FMV, manually coded in assembly language, is quite commonly used in a number of performance-critical libraries such as glibc and libjpeg-turbo. [[Intel C++ Compiler]], [[GNU Compiler Collection]] since GCC 6, and [[Clang]] since clang 7 allow for a simplified approach, with the compiler taking care of function duplication and selection. GCC and clang requires explicit {{code|target_clones}} labels in the code to "clone" functions,<ref>{{cite web |title=Function multi-versioning in GCC 6 |url=https://lwn.net/Articles/691932/ |website=lwn.net |date=22 June 2016 }}</ref> while ICC does so automatically (under the command-line option {{code|/Qax}}). The [[Rust programming language]] also supports FMV. The setup is similar to GCC and Clang in that the code defines what instruction sets to compile for, but cloning is manually done via inlining.<ref>{{cite web |title=2045-target-feature |url= https://rust-lang.github.io/rfcs/2045-target-feature.html |website=The Rust RFC Book}}</ref>
 
As using FMV requires code modification on GCC and Clang, vendors more commonly use library multi-versioning: this is easier to achieve as only compiler switches need to be changed. [[Glibc]] supports LMV and this functionality is adopted by the Intel-backed Clear Linux project.<ref name=clear>{{cite web |title=Transparent use of library packages optimized for Intel® architecture |url=https://clearlinux.org/news-blogs/transparent-use-library-packages-optimized-intel-architecture |website=Clear Linux* Project |access-date=8 September 2019 |language=en}}</ref>
Line 125 ⟶ 137:
Instances of these types are immutable and in optimized code are mapped directly to SIMD registers. Operations expressed in Dart typically are compiled into a single instruction without any overhead. This is similar to C and C++ intrinsics. Benchmarks for [[4×4 matrix|4×4]] [[matrix multiplication]], [[3D vertex transformation]], and [[Mandelbrot set]] visualization show near 400% speedup compared to scalar code written in Dart.
 
Intel announced at IDF 2013 that they were implementing McCutchan's specification for both [[V8 (JavaScript engine)|V8]] and [[SpiderMonkey]].<ref>{{cite web |title=SIMD in JavaScript |url=https://01.org/node/1495 |website=01.org |date=8 May 2014}}</ref> However, by 2017, SIMD.js was taken out of the [[ECMAScript]] standard queue in favor of pursuing a similar interface in [[WebAssembly]].<ref>{{cite web |title=tc39/ecmascript_simd: SIMD numeric type for EcmaScript. |url=https://github.com/tc39/ecmascript_simd/ |website=GitHub |publisher=Ecma TC39 |access-date=8 September 2019 |date=22 August 2019}}</ref> Support for SIMD was added to the WebAssembly 2.0 specification, which was finished on 2022 and became official on December 2024.<ref>{{cite web |url=https://webassembly.org/news/2025-03-20-wasm-2.0/ |title=Wasm 2.0 Completed - WebAssembly}}</ref> LLVM's autovectoring, when compiling C or C++ to WebAssembly, can target WebAssembly SIMD to automatically make use of SIMD, while SIMD intrinsic are also available.<ref>{{cite web |title=Using SIMD with WebAssembly |url=https://emscripten.org/docs/porting/simd.html |work=Emscripten 4.0.11-git (dev) documentation}}</ref>
LLVM supports auto-vectorization of programs to automatically make use of SIMD, while SIMD intrinsic are also available.<ref>{{cite web |title=Using SIMD with WebAssembly |url=https://emscripten.org/docs/porting/simd.html |website=Emscripten 4.0.11-git (dev) documentation}}</ref>
 
==Commercial applications==