Content deleted Content added
→SIMD reduction: mention AMD XOP set being removed, complicating horizontal sum Tags: Mobile edit Mobile web edit Advanced mobile edit Disambiguation links added |
m Removing link(s) Wikipedia:Articles for deletion/Permute instruction closed as soft delete (XFDcloser) |
||
(23 intermediate revisions by 7 users not shown) | |||
Line 1:
{{Short description|Computer processor which works on arrays of several numbers at once}}
In [[computing]], a '''vector processor''' is a [[central processing unit]] (CPU) that implements an [[instruction set]] where its [[Instruction (computer science)|instructions]] are designed to operate efficiently and
Vector processing techniques also operate in [[video game console|video-game console]] hardware and in [[graphics accelerator]]s but these are invariably [[Single instruction, multiple threads]] (SIMT) and occasionally [[Single instruction, multiple data]] (SIMD).
▲In [[computing]], a '''vector processor''' is a [[central processing unit]] (CPU) that implements an [[instruction set]] where its [[Instruction (computer science)|instructions]] are designed to operate efficiently and effectively on large [[Array data structure|one-dimensional array]]s of data called ''vectors''. This is in contrast to [[scalar processor]]s, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional [[single instruction, multiple data]] (SIMD) or [[SIMD within a register]] (SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably [[numerical simulation]], [[Data compression|compression]] and similar tasks.<ref>{{Cite web |title=asm-lessons/lesson_01/index.md at main · FFmpeg/asm-lessons |url=https://github.com/FFmpeg/asm-lessons/blob/main/lesson_01/index.md |access-date=2025-04-25 |website=GitHub |language=en}}</ref> Vector processing techniques also operate in [[video game console|video-game console]] hardware and in [[graphics accelerator]]s.
Vector machines appeared in the early 1970s and dominated [[supercomputer]] design through the 1970s into the 1990s, notably the various [[Cray]] platforms. The rapid fall in the [[price-to-performance ratio]] of conventional [[microprocessor]] designs led to a decline in vector supercomputers during the 1990s.
Line 11:
=== Early research and development ===
Array processing development began in the early 1960s at the [[Westinghouse Electric Corporation]] in their ''Solomon'' project. Solomon's goal was to dramatically increase math performance by using a large number of simple [[coprocessor]]s under the control of a single master [[Central processing unit]] (CPU). The CPU fed a single common instruction to all of the [[arithmetic logic unit]]s (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single [[algorithm]] to a large [[data set]], fed in the form of an array
In 1962, Westinghouse cancelled the project, but the effort was restarted by the [[University of Illinois at Urbana–Champaign]] as the [[ILLIAC IV]]. Their version of the design originally called for a 1 [[GFLOPS]] machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as [[computational fluid dynamics]], the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs,{{Fact or opinion|date=August 2025}} and is often referred to under a separate category of [[massively parallel]] computing: around 1972 Flynn [[Flynn's taxonomy|categorized]] this type of processing as an early form of [[single instruction, multiple threads]] (SIMT).
[[International Computers Limited]] sought to avoid many of the difficulties with the ILLIAC concept with its own [[Distributed Array Processor]] (DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1.<ref name="newscientist19760617_dap">{{ cite magazine | url=https://archive.org/details/bub_gb_m8S4bXj3dcMC/page/n11/mode/2up | title=Computers by the thousand | magazine=New Scientist | last1=Parkinson | first1=Dennis | date=17 June 1976 | access-date=7 July 2024 | pages=626–627 }}</ref>
Line 42:
=== GPU ===
{{See also|Single instruction, multiple threads|GPUs|Synchronous_graphics_random-access_memory}}
Modern graphics processing units ([[GPUs]]) include an array of [[shaders|shader pipelines]] which may be driven by [[compute kernel]]s, and being usually SIMT are frequently miscategorised as vector processors
▲Modern graphics processing units ([[GPUs]]) include an array of [[shaders|shader pipelines]] which may be driven by [[compute kernel]]s, and being usually SIMT are frequently miscategorised as vector processors.{{Citation needed|date=July 2025}} GPUs use a strategy for hiding memory latencies.{{Citation needed|date=July 2025}} As shown in [[Flynn's taxonomy|Flynn's 1972 paper]] the key distinguishing factor of SIMT-based GPUs is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This is ''significantly'' more complex and involved than [[Flynn's Taxonomy#Pipelined processor|"Packed SIMD"]], which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW<ref>[http://miaowgpu.org/ MIAOW Vertical Research Group]</ref> team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture.<ref>[https://github.com/VerticalResearchGroup/miaow/wiki/Architecture-Overview MIAOW GPU]</ref>
=== Recent development ===
Line 56 ⟶ 55:
* '''Pure (fixed) SIMD''' - also known as "Packed SIMD",<ref>{{cite conference|first1=Y.|last1=Miyaoka|first2=J.|last2=Choi|first3=N.|last3=Togawa|first4=M.|last4=Yanagisawa|first5=T.|last5=Ohtsuki|title=An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions|conference=Asia-Pacific Conference on Circuits and Systems|date=2002|pages=171–176|volume=1|doi=10.1109/APCCAS.2002.1114930|hdl=2065/10689|hdl-access=free}}</ref> [[SIMD within a register]] (SWAR), and [[Flynn's taxonomy#Pipelined processor|Pipelined Processor]] in Flynn's Taxonomy. Common examples using SIMD with features inspired by vector processors include: Intel x86's [[MMX (instruction set)|MMX]], [[Streaming SIMD Extensions|SSE]] and [[Advanced Vector Extensions|AVX]] instructions, AMD's [[3DNow!]] extensions, [[ARM NEON]], Sparc's [[Visual Instruction Set|VIS]] extension, [[PowerPC]]'s [[AltiVec]] and MIPS' [[MIPS_architecture#Application-specific_extensions|MSA]]. In 2000, [[IBM]], [[Toshiba]] and [[Sony]] collaborated to create the [[Cell processor]], which is also SIMD.
* '''Predicated SIMD''' - Two notable examples which have per-element [[Predication_(computer_architecture)|predication]] are [[Scalable Vector Extension|ARM SVE2]] and [[AVX-512]]
* '''Pure Vectors''' - as categorised in [[Duncan's taxonomy#Pipelined vector processors|Duncan's taxonomy]] - these include the original [[Cray-1]], [[Convex Computer|Convex C-Series]], [[NEC SX]], [[Fujitsu VP]] series, [[IBM 3090#Vector facility|IBM 3090 Vector facility]] and [[RISC-V#Vector set|RISC-V RVV]]. Although memory-based, both the [[TI_Advanced_Scientific_Computer#Architecture|TI ASC]] and the [[CDC STAR-100]]
Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as [[MIMD]] (Multiple Instruction, Multiple Data) and realized with [[VLIW]] (Very Long Instruction Word) and [[Explicitly parallel instruction computing|EPIC]] (Explicitly Parallel Instruction Computing). The [[Fujitsu FR-V]] low-power multimedia processor combines VLIW with [[Predication_(computer_architecture)|predication]] and Packed SIMD.<ref>* {{Cite book
Line 94 ⟶ 93:
==== Vector chaining ====
[[File:
Vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through [[Chaining (vector processing)|vector chaining]].<ref>{{Cite web|url=http://thebeardsage.com/vector-architecture/|title = Vector Architecture|date = 27 April 2020}}</ref><ref>[http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture11-vector.pdf Vector and SIMD processors, slides 12-13]</ref>
Line 378 ⟶ 377:
Further, both the Vector-Load and Vector-Store instructions could perform the update of the address registers, saving on explicit instructions to perform address-calculations that ''in effect'' had already been done, behind the scenes, when loading/storing each element of the Vector.
The exact same principle is used in [[Libre-SOC]], illustrated with a DAXPY (64-bit floating-point) example:<ref>{{cite web | title=Daxpy example | url=https://libre-soc.org/openpower/sv/cookbook/daxpy_example/ }}</ref>
<!-- Copyright (c) 2023,2024 Luke Kenneth Casson Leighton -->
Line 399 ⟶ 398:
The [[Power ISA]] CTR register is used for the total count instead of a scalar register, and the {{code|sv.bc/CTR}} instruction performs the reduction of CTR by the current Vector Length (VL) followed by testing CTR for being zero and branching if it is not.
The approach is different but achieves the same end-result as the IBM 3090: a significant compacting of instructions that are already considered highly compact.<ref>{{cite web | title=SIMD Instructions Considered Harmful | date=18 September 2017 | url=https://www.sigarch.org/simd-instructions-considered-harmful/ }}</ref> a [[RISC-V]] Vector DAXPY inner loop example is 10 instructions, where Libre-SOC and IBM 370 as shown above are both 6: a 40% reduction.
=== Vector reduction example ===
Line 452 ⟶ 451:
Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. The problems compound as the SIMD hardware width increases (an entirely new set of instruction is introduced), and, worse, some instructions to do Horizontal-sum such as the AMD [[XOP_instruction_set]], were ''removed'' after a few years. Some techniques involve shuffle:
examples online that involve highly detailed knowledge can be found for [[AVX-512]] and [[Streaming SIMD Extensions|SSE]] of how to do "Horizontal Sum"<ref>{{Cite web|url=https://stackoverflow.com/questions/52782940/1-to-4-broadcast-and-4-to-1-reduce-in-avx-512|title = Sse - 1-to-4 broadcast and 4-to-1 reduce in AVX-512}}</ref><ref>{{Cite web|url=https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026|title = Assembly - Fastest way to do horizontal SSE vector sum (Or other reduction)}}</ref>
Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors.
Line 490 ⟶ 489:
Compared to any SIMD processor claiming to be a vector processor, the order of magnitude reduction in program size is almost shocking. However, this level of elegance at the ISA level has quite a high price tag at the hardware level:
# From the IAXPY example, it can be seen that unlike SIMD processors, which can simplify their internal hardware by avoiding dealing with misaligned memory access, a vector processor cannot get away with such simplification: algorithms are written which inherently rely on Vector Load and Store being successful, regardless of alignment of the start of the vector.
# Whilst from the reduction example it can be seen that, aside from
Overall then there is a choice to either have
Line 511 ⟶ 510:
* '''Masked Operations''' – [[Predication (computer architecture)|predicate masks]] allow parallel if/then/else constructs without resorting to branches. This allows code with conditional statements to be vectorized.
* '''Compress and Expand''' – usually using a bit-mask, data is linearly compressed or expanded (redistributed) based on whether bits in the mask are set or clear, whilst always preserving the sequential order and never duplicating values (unlike Gather-Scatter aka permute). These instructions feature in [[AVX-512#Compress and expand|AVX-512]].
* '''Register Gather, Scatter (aka permute)'''<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions RVV register gather-scatter instructions]</ref> – a less restrictive more generic variation of the compress/expand theme which instead takes one vector to specify the indices to use to "reorder" another vector. Gather/scatter is more complex to implement than compress/expand, and, being inherently non-sequential, can interfere with [[Chaining (vector processing)|vector chaining]]. Not to be confused with [[Gather-scatter]] Memory Load/Store modes, Gather/scatter vector operations act on the vector registers, and are often termed a
* '''Splat and Extract''' – useful for interaction between scalar and vector, these broadcast a single value across a vector, or extract one item from a vector, respectively.
* '''Iota''' – a very simple and strategically useful instruction which drops sequentially-incrementing immediates into successive elements. Usually starts from zero.
Line 524 ⟶ 523:
* '''Sub-vectors''' – elements may typically contain two, three or four sub-elements (vec2, vec3, vec4) where any given bit of a predicate mask applies to the whole vec2/3/4, not the elements in the sub-vector. Sub-vectors are also introduced in RISC-V RVV (termed "LMUL").<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#mapping-for-lmul-1-2 LMUL > 1 in RVV]</ref> Subvectors are a critical integral part of the [[Vulkan]] [[SPIR-V]] spec.
* '''Sub-vector
* '''Transcendentals''' – [[trigonometric]] operations such as [[sine]], [[cosine]] and [[logarithm]] obviously feature much more predominantly in 3D than in many demanding [[High-performance computing|HPC]] workloads. Of interest, however, is that speed is far more important than accuracy in 3D for GPUs, where computation of pixel coordinates simply do not require high precision. The Vulkan specification recognises this and sets surprisingly low accuracy requirements, so that GPU Hardware can reduce power usage. The concept of reducing accuracy where it is simply not needed is explored in the [[MIPS-3D]] extension.
|