Content deleted Content added
Jerryobject (talk | contribs) m →Chronology: Small WP:COPYEDIT WP:EoS WP:TERSE. Hyphens > MOS:NDASHes. DeWP:LINK WP:REDs until article or section exists. |
m Removing link(s) Wikipedia:Articles for deletion/Permute instruction closed as soft delete (XFDcloser) |
||
(32 intermediate revisions by 8 users not shown) | |||
Line 1:
{{Short description|Type of parallel processing}}
{{Redirect|SIMD|the cryptographic hash function|SIMD (hash function)|the Scottish statistical tool|Scottish index of multiple deprivation}}
{{See also|SIMD within a register|Single instruction, multiple threads}}
{{Update|inaccurate=yes|date=March 2017}}▼
{{Flynn's Taxonomy}}
▲{{Update|inaccurate=yes|date=March 2017}}
[[File:SIMD2.svg|thumb|Single instruction, multiple data]]
'''Single instruction, multiple data''' ('''SIMD''') is a type of [[parallel computing]] (processing) in [[Flynn's taxonomy]]. SIMD describes computers with [[multiple processing elements]] that perform the same operation on multiple data points simultaneously. SIMD can be internal (part of the hardware design) and it can be directly accessible through an [[instruction set architecture]] (ISA), but it should not be confused with an ISA.
Such machines exploit [[Data parallelism|data level parallelism]], but not [[Concurrent computing|concurrency]]: there are simultaneous (parallel) computations, but each unit performs exactly the same instruction at any given moment (just with different data). A simple example is to add many pairs of numbers together, all of the SIMD units are performing an addition, but each one has different pairs of values to add. SIMD is especially applicable to common tasks such as adjusting the contrast in a [[digital image]] or adjusting the volume of [[digital audio]]. Most modern [[central processing unit]] (CPU) designs include SIMD instructions to improve the performance of [[multimedia]] use. In recent CPUs, SIMD units are tightly coupled with cache hierarchies and prefetch mechanisms, which minimize latency during large block operations. For instance, AVX-512-enabled processors can prefetch entire cache lines and apply fused multiply-add operations (FMA) in a single SIMD cycle.
== Confusion between SIMT and SIMD ==
SIMD has three different subcategories in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's 1972 Taxonomy]], one of which is [[single instruction, multiple threads]] (SIMT). SIMT should not be confused with [[Thread (computing)|software threads]] or [[Multithreading (computer architecture)|hardware threads]], both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution. A key distinction in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]] terminology) or wavefronts (Advanced Micro Devices ([[AMD]]) terminology). These allow divergence and convergence of threads, even under shared instruction streams, thereby offering slightly more flexibility than classical SIMD.▼
{{See also|SIMD within a register|Single instruction, multiple threads|Vector processor}}
[[Image:ILLIAC_IV.jpg|thumb|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971<ref>{{Cite web | title=Archived copy | url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-date=2024-04-27}}</ref>]]
Each hardware element (PU) working on individual data item sometimes also referred as SIMD lane or channel. Modern [[graphics processing unit]]s (GPUs) are often wide SIMD (typically >16 data lanes or channel) implementations.{{cn|date=July 2024}} Some newer GPUs go beyond simple SIMD and integrate mixed-precision SIMD pipelines, which allow concurrent execution of [[8-bit computing|8-bit]], [[16-bit computing|16-bit]], and [[32-bit computing|32-bit]] operations in different lanes. This is critical for applications like AI inference, where mixed precision boosts throughput.▼
▲SIMD has three different subcategories in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's 1972 Taxonomy]], one of which is [[single instruction, multiple threads]] (SIMT). SIMT should not be confused with [[Thread (computing)|software threads]] or [[Multithreading (computer architecture)|hardware threads]], both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution
SIMD should not be confused with [[Vector processing]], characterized by the [[Cray 1]] and clarified in [[Duncan's taxonomy]]. The
[[Vector processor#Difference between SIMD and vector processors|difference between SIMD and vector processors]] is primarily the presence of a Cray-style {{code|SET VECTOR LENGTH}} instruction.
One key distinction between SIMT and SIMD is that the SIMD unit will not have its own memory.
Another key distinction in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]] terminology) or wavefronts (Advanced Micro Devices ([[AMD]]) terminology). [[ILLIAC IV]] simply called them "Control Signals". These signals ensure that each Processing Element in the entire parallel array is synchronized in its simultaneous execution of the (one, current) broadcast instruction.
▲Each hardware element (PU, or PE in [[ILLIAC IV]] terminology) working on individual data item sometimes also referred to as a [[SIMD lane]] or channel. The ILLIAC IV PE was a scalar 64-bit unit that could do 2x32-bit [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|predication]]. Modern [[graphics processing unit]]s (GPUs) are
==History==
The first known operational use to date of [[SIMD within a register]] was the [[TX-2]], in 1958. It was capable of 36-bit operations and two 18-bit or four 9-bit sub-word operations.
The first commercial use of SIMD instructions was in the [[ILLIAC IV]], which was completed in 1972. This included 64 (of an original design of 256) processors that had local memory to hold different values while performing the same instruction. Separate hardware quickly sent out the values to be processed and gathered up the results. The first era of modern SIMD computers was characterized by [[massively parallel
The current era of SIMD processors grew out of the desktop-computer market rather than the supercomputer market. As desktop processors became powerful enough to support real-time gaming and audio/video processing during the 1990s, demand grew for this type of computing power, and microprocessor vendors turned to SIMD to meet the demand.<ref name="conte">{{cite conference |title=The long and winding road to high-performance image processing with MMX/SSE |first1=G. |last1=Conte |first2=S. |last2=Tommesani |first3=F. |last3=Zanichelli |book-title=Proc. Fifth IEEE Int'l Workshop on Computer Architectures for Machine Perception |year=2000 |doi=10.1109/CAMP.2000.875989 |s2cid=13180531 |hdl=11381/2297671}}</ref> This resurgence also coincided with the rise of [[DirectX]] and OpenGL shader models, which heavily leveraged SIMD under the hood. The graphics APIs encouraged programmers to adopt data-parallel programming styles, indirectly accelerating SIMD adoption in desktop software. Hewlett-Packard introduced [[Multimedia Acceleration eXtensions]] (MAX) instructions into [[PA-RISC]] 1.1 desktops in 1994 to accelerate MPEG decoding.<ref>{{cite book |first=R.B. |last=Lee |chapter=Realtime MPEG video via software decompression on a PA-RISC processor |title=digest of papers Compcon '95. Technologies for the Information Superhighway |year=1995 |pages=186–192 |doi=10.1109/CMPCON.1995.512384 |isbn=0-8186-7029-0|s2cid=2262046}}</ref> Sun Microsystems introduced SIMD integer instructions in its "[[Visual Instruction Set|VIS]]" instruction set extensions in 1995, in its [[UltraSPARC|UltraSPARC I]] microprocessor. MIPS followed suit with their similar [[MDMX]] system.
Line 42 ⟶ 54:
* Programming with given SIMD instruction sets can involve many low-level challenges.
*# SIMD may have restrictions on [[Data structure alignment|data alignment]]; programmers familiar with a given architecture may not expect this. Worse: the alignment may change from one revision or "compatible" processor to another.
*# Gathering data into SIMD registers and scattering it to the correct destination locations is tricky (sometimes requiring
*# Specific instructions like rotations or three-operand addition are not available in some SIMD instruction sets.
*# Instruction sets are architecture-specific: some processors lack SIMD instructions entirely, so programmers must provide non-vectorized implementations (or different vectorized implementations) for them.
Line 48 ⟶ 60:
*# The early [[MMX (instruction set)|MMX]] instruction set shared a register file with the floating-point stack, which caused inefficiencies when mixing floating-point and MMX code. However, [[SSE2]] corrects this.
To remedy problems 1 and 5, Cray-style [[
ARM's [[Scalable Vector Extension]] takes another approach, known in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's Taxonomy]]
==Chronology==
Line 60 ⟶ 72:
! Year !! Example
|-
| 1974 || [[ILLIAC IV]] - an Array Processor comprising scalar 64-bit PEs
|-
| 1974 || [[ICL Distributed Array Processor]] (DAP)
Line 114 ⟶ 126:
* Library multi-versioning (LMV): the entire [[Library (computing)|programming library]] is duplicated for many instruction set extensions, and the operating system or the program decides which one to load at run-time.
FMV, manually coded in assembly language, is quite commonly used in a number of performance-critical libraries such as glibc and libjpeg-turbo. [[Intel C++ Compiler]], [[GNU Compiler Collection]] since GCC 6, and [[Clang]] since clang 7 allow for a simplified approach, with the compiler taking care of function duplication and selection. GCC and clang requires explicit {{code|target_clones}} labels in the code to "clone" functions,<ref>{{cite web |title=Function multi-versioning in GCC 6 |url=https://lwn.net/Articles/691932/ |website=lwn.net |date=22 June 2016 }}</ref> while ICC does so automatically (under the command-line option {{code|/Qax}}). The [[Rust programming language]] also supports FMV. The setup is similar to GCC and Clang in that the code defines what instruction sets to compile for, but cloning is manually done via inlining.<ref>{{cite web |title=2045-target-feature |url= https://rust-lang.github.io/rfcs/2045-target-feature.html |website=The Rust RFC Book}}</ref>
As using FMV requires code modification on GCC and Clang, vendors more commonly use library multi-versioning: this is easier to achieve as only compiler switches need to be changed. [[Glibc]] supports LMV and this functionality is adopted by the Intel-backed Clear Linux project.<ref name=clear>{{cite web |title=Transparent use of library packages optimized for Intel® architecture |url=https://clearlinux.org/news-blogs/transparent-use-library-packages-optimized-intel-architecture |website=Clear Linux* Project |access-date=8 September 2019 |language=en}}</ref>
Line 125 ⟶ 137:
Instances of these types are immutable and in optimized code are mapped directly to SIMD registers. Operations expressed in Dart typically are compiled into a single instruction without any overhead. This is similar to C and C++ intrinsics. Benchmarks for [[4×4 matrix|4×4]] [[matrix multiplication]], [[3D vertex transformation]], and [[Mandelbrot set]] visualization show near 400% speedup compared to scalar code written in Dart.
Intel announced at IDF 2013 that they were implementing McCutchan's specification for both [[V8 (JavaScript engine)|V8]] and [[SpiderMonkey]].<ref>{{cite web |title=SIMD in JavaScript |url=https://01.org/node/1495 |website=01.org |date=8 May 2014}}</ref> However, by 2017, SIMD.js was taken out of the [[ECMAScript]] standard queue in favor of pursuing a similar interface in [[WebAssembly]].<ref>{{cite web |title=tc39/ecmascript_simd: SIMD numeric type for EcmaScript. |url=https://github.com/tc39/ecmascript_simd/ |website=GitHub |publisher=Ecma TC39 |access-date=8 September 2019 |date=22 August 2019}}</ref> Support for SIMD was added to the WebAssembly 2.0 specification, which was finished on 2022 and became official on December 2024.<ref>{{cite web |url=https://webassembly.org/news/2025-03-20-wasm-2.0/ |title=Wasm 2.0 Completed - WebAssembly}}</ref> LLVM's autovectoring, when compiling C or C++ to WebAssembly, can target WebAssembly SIMD to automatically make use of SIMD, while SIMD intrinsic are also available.<ref>{{cite web |title=Using SIMD with WebAssembly |url=https://emscripten.org/docs/porting/simd.html |work=Emscripten 4.0.11-git (dev) documentation}}</ref>
==Commercial applications==
|