Single instruction, multiple data: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 06:48, 27 July 2025 edit Lkcl (talk \| contribs) Extended confirmed users 3,004 edits →Disadvantages: clarity on difference between VP and SIM. also someone used RVV as a sole-exclusive example: referring to VP and "Cray-style" feels better Tags: Mobile edit Mobile web edit Advanced mobile edit ← Previous edit		Latest revision as of 22:20, 25 August 2025 edit undo Liz (talk \| contribs) Autopatrolled, Checkusers, Oversighters, Administrators 843,494 edits m Removing link(s) Wikipedia:Articles for deletion/Permute instruction closed as soft delete (XFDcloser)
(13 intermediate revisions by 6 users not shown)
Line 1: {{Short description\|Type of parallel processing}} {{Redirect\|SIMD\|the cryptographic hash function\|SIMD (hash function)\|the Scottish statistical tool\|Scottish index of multiple deprivation}} {{Update\|inaccurate=yes\|date=March 2017}}▼ {{Flynn's Taxonomy}}▼ {{See also\|SIMD within a register\|Single instruction, multiple threads}} ▲{{Flynn's Taxonomy}} ▲{{Update\|inaccurate=yes\|date=March 2017}} [[File:SIMD2.svg\|thumb\|Single instruction, multiple data]] Line 11: Such machines exploit [[Data parallelism\|data level parallelism]], but not [[Concurrent computing\|concurrency]]: there are simultaneous (parallel) computations, but each unit performs exactly the same instruction at any given moment (just with different data). A simple example is to add many pairs of numbers together, all of the SIMD units are performing an addition, but each one has different pairs of values to add. SIMD is especially applicable to common tasks such as adjusting the contrast in a [[digital image]] or adjusting the volume of [[digital audio]]. Most modern [[central processing unit]] (CPU) designs include SIMD instructions to improve the performance of [[multimedia]] use. In recent CPUs, SIMD units are tightly coupled with cache hierarchies and prefetch mechanisms, which minimize latency during large block operations. For instance, AVX-512-enabled processors can prefetch entire cache lines and apply fused multiply-add operations (FMA) in a single SIMD cycle. == Confusion between SIMT and SIMD == [[Image:ILLIAC_IV.jpg\|thumb\|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971.<ref>https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf</ref>]] ▼ {{See also\|SIMD within a register\|Single instruction, multiple threads\|Vector processor}} ▲[[Image:ILLIAC_IV.jpg\|thumb\|[[ILLIAC IV]] Array overview, from ARPA-funded Introductory description by Steward Denenberg, July 15 1971.<ref>{{Cite web \| title=Archived copy \| url=https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-url=https://web.archive.org/web/20240427173522/https://apps.dtic.mil/sti/tr/pdf/ADA954882.pdf \| archive-date=2024-04-27}}</ref>]] SIMD has three different subcategories in [[Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)\|Flynn's 1972 Taxonomy]], one of which is [[single instruction, multiple threads]] (SIMT). SIMT should not be confused with [[Thread (computing)\|software threads]] or [[Multithreading (computer architecture)\|hardware threads]], both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution, such as in the [[ILLIAC IV]]. Line 18 ⟶ 21: [[Vector processor#Difference between SIMD and vector processors\|difference between SIMD and vector processors]] is primarily the presence of a Cray-style {{code\|SET VECTOR LENGTH}} instruction. One key distinction between SIMT and SIMD is that the SIMD unit will not have its own memory ~~(a SIMT system could ''use'' a SIMD unit: usually termed [[SIMD lanes]])~~. Another key distinction in SIMT is the presence of control flow mechanisms like warps ([[Nvidia]] terminology) or wavefronts (Advanced Micro Devices ([[AMD]]) terminology). [[ILLIAC IV]] simply called them "Control Signals". These ~~allow~~signals ~~divergence~~ensure ~~and~~that ~~convergence~~each ofProcessing ~~threads,~~Element ~~even~~in ~~under~~the ~~shared~~entire ~~instruction~~parallel ~~streams,~~array ~~thereby~~is ~~offering~~synchronized ~~slightly~~in ~~more~~its ~~flexibility~~simultaneous ~~than classical [[SIMD within a register]].{{clarify\|reason=Is classical SIMD one~~execution of the ~~subcategories in Flynn's 1972 paper? If so~~(one, ~~which~~current) ~~subcategory?\|date=July~~broadcast ~~2025}}~~instruction. Each hardware element (PU, or PE in [[ILLIAC IV]] terminology) working on individual data item sometimes also referred to as a [[SIMD lane]] or channel,. ~~although the~~The ILLIAC IV PE was a scalar 64-bit unit that could do 2x32-bit [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication\|predication]]. Modern [[graphics processing unit]]s (GPUs) are invariably wide [[SIMD within a register]] (SWAR) and typically have more that 16 data lanes or channels of such Processing Elements.{{cn\|date=July 2024}} Some newer GPUs integrate mixed-precision {{cn\|date=July 2025}} SWAR pipelines, which performs concurrent sub-word [[8-bit computing\|8-bit]], [[16-bit computing\|16-bit]], and [[32-bit computing\|32-bit]] operations. This is critical for applications like AI inference, where mixed precision boosts throughput. ==History== Line 51 ⟶ 54: * Programming with given SIMD instruction sets can involve many low-level challenges. # SIMD may have restrictions on [[Data structure alignment\|data alignment]]; programmers familiar with a given architecture may not expect this. Worse: the alignment may change from one revision or "compatible" processor to another. # Gathering data into SIMD registers and scattering it to the correct destination locations is tricky (sometimes requiring [[permute ~~instruction]]s~~instructions (operations) and can be inefficient. # Specific instructions like rotations or three-operand addition are not available in some SIMD instruction sets. # Instruction sets are architecture-specific: some processors lack SIMD instructions entirely, so programmers must provide non-vectorized implementations (or different vectorized implementations) for them. Line 123 ⟶ 126: * Library multi-versioning (LMV): the entire [[Library (computing)\|programming library]] is duplicated for many instruction set extensions, and the operating system or the program decides which one to load at run-time. FMV, manually coded in assembly language, is quite commonly used in a number of performance-critical libraries such as glibc and libjpeg-turbo. [[Intel C++ Compiler]], [[GNU Compiler Collection]] since GCC 6, and [[Clang]] since clang 7 allow for a simplified approach, with the compiler taking care of function duplication and selection. GCC and clang requires explicit {{code\|target_clones}} labels in the code to "clone" functions,<ref>{{cite web \|title=Function multi-versioning in GCC 6 \|url=https://lwn.net/Articles/691932/ \|website=lwn.net \|date=22 June 2016 }}</ref> while ICC does so automatically (under the command-line option {{code\|/Qax}}). The [[Rust programming language]] also supports FMV. The setup is similar to GCC and Clang in that the code defines what instruction sets to compile for, but cloning is manually done via inlining.<ref>{{cite web \|title=2045-target-feature \|url= https://rust-lang.github.io/rfcs/2045-target-feature.html \|website=The Rust RFC Book}}</ref> As using FMV requires code modification on GCC and Clang, vendors more commonly use library multi-versioning: this is easier to achieve as only compiler switches need to be changed. [[Glibc]] supports LMV and this functionality is adopted by the Intel-backed Clear Linux project.<ref name=clear>{{cite web \|title=Transparent use of library packages optimized for Intel® architecture \|url=https://clearlinux.org/news-blogs/transparent-use-library-packages-optimized-intel-architecture \|website=Clear Linux* Project \|access-date=8 September 2019 \|language=en}}</ref>