Single instruction, multiple data: Difference between revisions

Content deleted Content added
moving paragraphs on SIMT and vector processing (this is the delete)
Tags: Reverted Mobile edit Mobile web edit Advanced mobile edit
m Removing link(s) Wikipedia:Articles for deletion/Permute instruction closed as soft delete (XFDcloser)
 
(8 intermediate revisions by 3 users not shown)
Line 1:
{{Short description|Type of parallel processing}}
{{Redirect|SIMD|the cryptographic hash function|SIMD (hash function)|the Scottish statistical tool|Scottish index of multiple deprivation}}
{{Update|inaccurate=yes|date=March 2017}}
{{Flynn's Taxonomy}}
{{See also|SIMD within a register|Single instruction, multiple threads}}
{{Flynn's Taxonomy}}
{{Update|inaccurate=yes|date=March 2017}}
 
[[File:SIMD2.svg|thumb|Single instruction, multiple data]]
Line 10:
 
Such machines exploit [[Data parallelism|data level parallelism]], but not [[Concurrent computing|concurrency]]: there are simultaneous (parallel) computations, but each unit performs exactly the same instruction at any given moment (just with different data). A simple example is to add many pairs of numbers together, all of the SIMD units are performing an addition, but each one has different pairs of values to add. SIMD is especially applicable to common tasks such as adjusting the contrast in a [[digital image]] or adjusting the volume of [[digital audio]]. Most modern [[central processing unit]] (CPU) designs include SIMD instructions to improve the performance of [[multimedia]] use. In recent CPUs, SIMD units are tightly coupled with cache hierarchies and prefetch mechanisms, which minimize latency during large block operations. For instance, AVX-512-enabled processors can prefetch entire cache lines and apply fused multiply-add operations (FMA) in a single SIMD cycle.
 
== Confusion between SIMT and SIMD ==
{{See also|SIMD within a register|Single instruction, multiple threads|Vector processor}}
 
[[Image:ILLIAC_IV.jpg|thumb|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971<ref>{{Cite web | title=Archived copy | url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf | archive-date=2024-04-27}}</ref>]]
 
SIMD has three different subcategories in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)|Flynn's 1972 Taxonomy]], one of which is [[single instruction, multiple threads]] (SIMT). SIMT should not be confused with [[Thread (computing)|software threads]] or [[Multithreading (computer architecture)|hardware threads]], both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution, such as in the [[ILLIAC IV]].
 
SIMD should not be confused with [[Vector processing]], characterized by the [[Cray 1]] and clarified in [[Duncan's taxonomy]]. The
[[Vector processor#Difference between SIMD and vector processors|difference between SIMD and vector processors]] is primarily the presence of a Cray-style {{code|SET VECTOR LENGTH}} instruction.
 
One key distinction between SIMT and SIMD is that the SIMD unit will not have its own memory.
Another key distinction in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]] terminology) or wavefronts (Advanced Micro Devices ([[AMD]]) terminology). [[ILLIAC IV]] simply called them "Control Signals". These signals ensure that each Processing Element in the entire parallel array is synchronized in its simultaneous execution of the (one, current) broadcast instruction.
 
Each hardware element (PU, or PE in [[ILLIAC IV]] terminology) working on individual data item sometimes also referred to as a [[SIMD lane]] or channel. The ILLIAC IV PE was a scalar 64-bit unit that could do 2x32-bit [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication|predication]]. Modern [[graphics processing unit]]s (GPUs) are invariably wide [[SIMD within a register]] (SWAR) and typically have more that 16 data lanes or channels of such Processing Elements.{{cn|date=July 2024}} Some newer GPUs integrate mixed-precision {{cn|date=July 2025}} SWAR pipelines, which performs concurrent sub-word [[8-bit computing|8-bit]], [[16-bit computing|16-bit]], and [[32-bit computing|32-bit]] operations. This is critical for applications like AI inference, where mixed precision boosts throughput.
 
==History==
Line 39 ⟶ 54:
* Programming with given SIMD instruction sets can involve many low-level challenges.
*# SIMD may have restrictions on [[Data structure alignment|data alignment]]; programmers familiar with a given architecture may not expect this. Worse: the alignment may change from one revision or "compatible" processor to another.
*# Gathering data into SIMD registers and scattering it to the correct destination locations is tricky (sometimes requiring [[permute instruction]]sinstructions (operations) and can be inefficient.
*# Specific instructions like rotations or three-operand addition are not available in some SIMD instruction sets.
*# Instruction sets are architecture-specific: some processors lack SIMD instructions entirely, so programmers must provide non-vectorized implementations (or different vectorized implementations) for them.