Single instruction, multiple data: Difference between revisions

Content deleted Content added
mention SIMD lanes and clarify a little by mentioning SWAR
Tags: Mobile edit Mobile web edit Advanced mobile edit
m Removing link(s) Wikipedia:Articles for deletion/Permute instruction closed as soft delete (XFDcloser)
 
(20 intermediate revisions by 6 users not shown)
Line 1:
{{Short description|Type of parallel processing}}
{{Redirect|SIMD|the cryptographic hash function|SIMD (hash function)|the Scottish statistical tool|Scottish index of multiple deprivation}}
{{Update|inaccurate=yes|date=March 2017}}
{{Flynn's Taxonomy}}
{{See also|SIMD within a register|Single instruction, multiple threads}}
{{Flynn's Taxonomy}}
{{Update|inaccurate=yes|date=March 2017}}
 
[[File:SIMD2.svg|thumb|Single instruction, multiple data]]
Line 11:
Such machines exploit [[Data parallelism|data level parallelism]], but not [[Concurrent computing|concurrency]]: there are simultaneous (parallel) computations, but each unit performs exactly the same instruction at any given moment (just with different data). A simple example is to add many pairs of numbers together, all of the SIMD units are performing an addition, but each one has different pairs of values to add. SIMD is especially applicable to common tasks such as adjusting the contrast in a [[digital image]] or adjusting the volume of [[digital audio]]. Most modern [[central processing unit]] (CPU) designs include SIMD instructions to improve the performance of [[multimedia]] use. In recent CPUs, SIMD units are tightly coupled with cache hierarchies and prefetch mechanisms, which minimize latency during large block operations. For instance, AVX-512-enabled processors can prefetch entire cache lines and apply fused multiply-add operations (FMA) in a single SIMD cycle.
 
== Confusion between SIMT and SIMD ==
[[Image:ILLIAC_IV.jpg|thumb|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971.<ref>https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf</ref>]]
{{See also|SIMD within a register|Single instruction, multiple threads|Vector processor}}
 
[[Image:ILLIAC_IV.jpg|thumb|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971.<ref>{{Cite web | title=Archived copy | url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-date=2024-04-27}}</ref>]]
 
SIMD has three different subcategories in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's 1972 Taxonomy]], one of which is [[single instruction, multiple threads]] (SIMT). SIMT should not be confused with [[Thread (computing)|software threads]] or [[Multithreading (computer architecture)|hardware threads]], both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution, such as in the [[ILLIAC IV]].
 
SIMD should not be confused with [[Vector processing]], characterized by the [[Cray 1]] and clarified in [[Duncan's taxonomy]]. The
One key distinction between SIMT and SIMD is that the SIMD unit will not have its own memory (a SIMT system could ''use'' a SIMD unit: usually termed [[SIMD lanes]]).
[[Vector processor#Difference between SIMD and vector processors|difference between SIMD and vector processors]] is primarily the presence of a Cray-style {{code|SET VECTOR LENGTH}} instruction.
Another key distinction in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]] terminology) or wavefronts (Advanced Micro Devices ([[AMD]]) terminology). [[ILLIAC IV]] simply called them "Control Signals". These allow divergence and convergence of threads, even under shared instruction streams, thereby offering slightly more flexibility than classical [[SIMD within a register]].{{clarify|reason=Is classical SIMD one of the subcategories in Flynn's 1972 paper? If so, which subcategory?|date=July 2025}}
 
One key distinction between SIMT and SIMD is that the SIMD unit will not have its own memory (a SIMT system could ''use'' a SIMD unit: usually termed [[SIMD lanes]]).
Each hardware element (PU, or PE in [[ILLIAC IV]] terminology) working on individual data item sometimes also referred as SIMD lane or channel. Modern [[graphics processing unit]]s (GPUs) are often wide SIMD (typically >16 data lanes or channel) implementations.{{cn|date=July 2024}} Some newer GPUs go beyond simple SIMD and integrate mixed-precision SIMD pipelines, which allow concurrent execution of [[8-bit computing|8-bit]], [[16-bit computing|16-bit]], and [[32-bit computing|32-bit]] operations in different lanes. This is critical for applications like AI inference, where mixed precision boosts throughput.
Another key distinction in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]] terminology) or wavefronts (Advanced Micro Devices ([[AMD]]) terminology). [[ILLIAC IV]] simply called them "Control Signals". These allowsignals divergenceensure andthat convergenceeach ofProcessing threads,Element evenin underthe sharedentire instructionparallel streams,array therebyis offeringsynchronized slightlyin moreits flexibilitysimultaneous than classical [[SIMD within a register]].{{clarify|reason=Is classical SIMD oneexecution of the subcategories in Flynn's 1972 paper? If so(one, whichcurrent) subcategory?|date=Julybroadcast 2025}}instruction.
 
Each hardware element (PU, or PE in [[ILLIAC IV]] terminology) working on individual data item sometimes also referred to as a [[SIMD lane]] or channel. The ILLIAC IV PE was a scalar 64-bit unit that could do 2x32-bit [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|predication]]. Modern [[graphics processing unit]]s (GPUs) are ofteninvariably wide [[SIMD within a register]] (SWAR) and typically >have more that 16 data lanes or channel)channels of such Processing implementationsElements.{{cn|date=July 2024}} Some newer GPUs go beyond simple SIMD and integrate mixed-precision SIMD{{cn|date=July 2025}} SWAR pipelines, which allowperforms concurrent execution ofsub-word [[8-bit computing|8-bit]], [[16-bit computing|16-bit]], and [[32-bit computing|32-bit]] operations in different lanes. This is critical for applications like AI inference, where mixed precision boosts throughput.
Additionally, SIMD can exist in both fixed and scalable vector forms. Fixed-width SIMD units operate on a constant number of data points per instruction, while scalable designs, like RISC-V Vector or ARM's SVE, allow the number of data elements to vary depending on the hardware implementation. This improves forward compatibility across generations of processors.
 
==History==
The first known operational use to date of [[SIMD within a register]] was the [[TX-2]], in 1958. It was capable of 36-bit operations and two 18-bit or four 9-bit sub-word operations.

The first commercial use of SIMD instructions was in the [[ILLIAC IV]], which was completed in 1972. This included 64 (of an original design of 256) processors that had local memory to hold different values while performing the same instruction. Separate hardware quickly sent out the values to be processed and gathered up the results.
 
[[vector processor|Vector supercomputers]] of the early 1970s such as the [[CDC STAR-100|CDC Star-100]] and the [[TI Advanced Scientific Computer|Texas Instruments ASC]] could operate on a "vector" of data with a single instruction. Vector processing was especially popularized by [[Cray]] in the 1970s and 1980s. Vector processing architectures are now considered separate from SIMD computers: [[Duncan's Taxonomy]] includes them whereas [[Flynn's Taxonomy]] does not, due to Flynn's work (1966, 1972) pre-dating the [[Cray-1]] (1977). The complexity of Vector processors however inspired a simpler arrangement known as [[SIMD within a register]].
Line 48 ⟶ 54:
* Programming with given SIMD instruction sets can involve many low-level challenges.
*# SIMD may have restrictions on [[Data structure alignment|data alignment]]; programmers familiar with a given architecture may not expect this. Worse: the alignment may change from one revision or "compatible" processor to another.
*# Gathering data into SIMD registers and scattering it to the correct destination locations is tricky (sometimes requiring [[permute instruction]]sinstructions (operations) and can be inefficient.
*# Specific instructions like rotations or three-operand addition are not available in some SIMD instruction sets.
*# Instruction sets are architecture-specific: some processors lack SIMD instructions entirely, so programmers must provide non-vectorized implementations (or different vectorized implementations) for them.
Line 54 ⟶ 60:
*# The early [[MMX (instruction set)|MMX]] instruction set shared a register file with the floating-point stack, which caused inefficiencies when mixing floating-point and MMX code. However, [[SSE2]] corrects this.
 
To remedy problems 1 and 5, Cray-style [[RISC-VVector processors]]'s vector extension usesuse an alternative approach: instead of exposing the sub-register-level details directly to the programmer, the instruction set abstracts them out asat aleast fewthe "vectorlength registers"(number thatof useelements) theinto samea interfacesruntime acrosscontrol allregister, CPUsusually withnamed this"VL" instruction(Vector setLength). The hardware then handles all alignment issues and "strip-mining" of loops. Machines with different vector sizes would be able to run the same code. LLVM calls this vector type "{{not a typo|vscale}}".{{citation needed|date=June 2021}}
 
AnWith SIMD, an order of magnitude increase in code size is not uncommon, when compared to equivalent scalar or equivalent vector code, and an order of magnitude ''or greater'' effectiveness (work done per instruction) is achievable with Vector ISAs.<ref>{{cite web |last1=Patterson |first1=David |last2=Waterman |first2=Andrew |title=SIMD Instructions Considered Harmful |url=https://www.sigarch.org/simd-instructions-considered-harmful/ |website=SIGARCH |date=18 September 2017}}</ref>
 
ARM's [[Scalable Vector Extension]] takes another approach, known in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's Taxonomy]] as "Associative Processing", more commonly known today as [[Predication (computer architecture)#SIMD, SIMT and vector predication|"Predicated" (masked)]] SIMD. This approach is not as compact as [[vector processing]] but is still far better than non-predicated SIMD. Detailed comparative examples are given at {{section link|Vector processor|Vector instruction example}}. In addition, all versions of the ARM architecture have offered Load and Store multiple instructions, to Load or Store a block of data from a continuous block of memory, into a range or non-continuous set of registers.<ref>{{Cite web |title=ARM LDR/STR, LDM/STM instructions - Programmer All |url=https://programmerall.com/article/2483661565/ |access-date=2025-04-19 |website=programmerall.com}}</ref>
 
==Chronology==
Line 66 ⟶ 72:
! Year !! Example
|-
| 1974 || [[ILLIAC IV]] - an Array Processor comprising scalar 64-bit PEs
|-
| 1974 || [[ICL Distributed Array Processor]] (DAP)
Line 120 ⟶ 126:
* Library multi-versioning (LMV): the entire [[Library (computing)|programming library]] is duplicated for many instruction set extensions, and the operating system or the program decides which one to load at run-time.
 
FMV, manually coded in assembly language, is quite commonly used in a number of performance-critical libraries such as glibc and libjpeg-turbo. [[Intel C++ Compiler]], [[GNU Compiler Collection]] since GCC 6, and [[Clang]] since clang 7 allow for a simplified approach, with the compiler taking care of function duplication and selection. GCC and clang requires explicit {{code|target_clones}} labels in the code to "clone" functions,<ref>{{cite web |title=Function multi-versioning in GCC 6 |url=https://lwn.net/Articles/691932/ |website=lwn.net |date=22 June 2016 }}</ref> while ICC does so automatically (under the command-line option {{code|/Qax}}). The [[Rust programming language]] also supports FMV. The setup is similar to GCC and Clang in that the code defines what instruction sets to compile for, but cloning is manually done via inlining.<ref>{{cite web |title=2045-target-feature |url= https://rust-lang.github.io/rfcs/2045-target-feature.html |website=The Rust RFC Book}}</ref>
 
As using FMV requires code modification on GCC and Clang, vendors more commonly use library multi-versioning: this is easier to achieve as only compiler switches need to be changed. [[Glibc]] supports LMV and this functionality is adopted by the Intel-backed Clear Linux project.<ref name=clear>{{cite web |title=Transparent use of library packages optimized for Intel® architecture |url=https://clearlinux.org/news-blogs/transparent-use-library-packages-optimized-intel-architecture |website=Clear Linux* Project |access-date=8 September 2019 |language=en}}</ref>