Vector processor: Difference between revisions

Content deleted Content added
IBM 370 Vector facility: explain 370 assembler code
Tags: Mobile edit Mobile web edit Advanced mobile edit
m Removing link(s) Wikipedia:Articles for deletion/Permute instruction closed as soft delete (XFDcloser)
 
(40 intermediate revisions by 9 users not shown)
Line 1:
{{Short description|Computer processor which works on arrays of several numbers at once}}
In [[computing]], a '''vector processor''' is a [[central processing unit]] (CPU) that implements an [[instruction set]] where its [[Instruction (computer science)|instructions]] are designed to operate efficiently and effectively[[Flynn%27s_taxonomy#Pipelined_processors|architecturally sequentially]] on large [[Array data structure|one-dimensional array]]s of data called ''vectors''. This is in contrast to [[scalar processor]]s, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional [[single instruction, multiple data]] (SIMD) or [[SIMD within a register]] (SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably [[numerical simulation]], [[Data compression|compression]] and similar tasks.<ref>{{Cite web |title=asm-lessons/lesson_01/index.md at main · FFmpeg/asm-lessons |url=https://github.com/FFmpeg/asm-lessons/blob/main/lesson_01/index.md |access-date=2025-04-25 |website=GitHub |language=en}}</ref> Vector processing techniques also operate in [[video game console|video-game console]] hardware and in [[graphics accelerator]]s.
{{About|Processors (including [[GPU]]s) that were specifically designed from the ground up to handle large Vectors (Arrays)|SIMD instructions present in some general-purpose computers|Flynn's taxonomy#Single instruction stream, multiple data streams (SIMD)}}
 
Vector processing techniques also operate in [[video game console|video-game console]] hardware and in [[graphics accelerator]]s but these are invariably [[Single instruction, multiple threads]] (SIMT) and occasionally [[Single instruction, multiple data]] (SIMD).
In [[computing]], a '''vector processor''' is a [[central processing unit]] (CPU) that implements an [[instruction set]] where its [[Instruction (computer science)|instructions]] are designed to operate efficiently and effectively on large [[Array data structure|one-dimensional array]]s of data called ''vectors''. This is in contrast to [[scalar processor]]s, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional [[single instruction, multiple data]] (SIMD) or [[SIMD within a register]] (SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably [[numerical simulation]], [[Data compression|compression]] and similar tasks.<ref>{{Cite web |title=asm-lessons/lesson_01/index.md at main · FFmpeg/asm-lessons |url=https://github.com/FFmpeg/asm-lessons/blob/main/lesson_01/index.md |access-date=2025-04-25 |website=GitHub |language=en}}</ref> Vector processing techniques also operate in [[video game console|video-game console]] hardware and in [[graphics accelerator]]s.
 
Vector machines appeared in the early 1970s and dominated [[supercomputer]] design through the 1970s into the 1990s, notably the various [[Cray]] platforms. The rapid fall in the [[price-to-performance ratio]] of conventional [[microprocessor]] designs led to a decline in vector supercomputers during the 1990s.
 
== History ==
{{See also|History_of_supercomputingHistory of supercomputing}}
 
=== Early research and development ===
 
Array processing development began in the early 1960s at the [[Westinghouse Electric Corporation]] in their ''Solomon'' project. Solomon's goal was to dramatically increase math performance by using a large number of simple [[coprocessor]]s under the control of a single master [[Central processing unit]] (CPU). The CPU fed a single common instruction to all of the [[arithmetic logic unit]]s (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single [[algorithm]] to a large [[data set]], fed in the form of an array.{{cn|date=July, 2023}}leading it to be cited as an example Array processor in [[Flynn's taxonomy]].
 
In 1962, Westinghouse cancelled the project, but the effort was restarted by the [[University of Illinois at Urbana–Champaign]] as the [[ILLIAC IV]]. Their version of the design originally called for a 1 [[GFLOPS]] machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as [[computational fluid dynamics]], the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs,{{Fact or opinion|date=August 2025}} and is often referred to under a separate category of [[massively parallel]] computing: around 1972 Flynn [[Flynn's taxonomy|categorized]] this type of processing as an early form of [[single instruction, multiple threads]] (SIMT).
 
[[International Computers Limited]] sought to avoid many of the difficulties with the ILLIAC concept with its own [[Distributed Array Processor]] (DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1.<ref name="newscientist19760617_dap">{{ cite magazine | url=https://archive.org/details/bub_gb_m8S4bXj3dcMC/page/n11/mode/2up | title=Computers by the thousand | magazine=New Scientist | last1=Parkinson | first1=Dennis | date=17 June 1976 | access-date=7 July 2024 | pages=626–627 }}</ref>
Line 42:
 
=== GPU ===
{{See also|Single instruction, multiple threads|GPUs|Synchronous_graphics_random-access_memory}}
Modern graphics processing units ([[GPUs]]) include an array of [[shaders|shader pipelines]] which may be driven by [[compute kernel]]s, and being usually SIMT are frequently miscategorised as vector processors.{{Citation, the two being very similar categorisation of SIMD in needed[[Flynn%27s_taxonomy#Single_instruction_stream,_multiple_data_streams_(SIMD)|date=JulyFlynn's 2025}}1972 SIMD taxonomy]]. GPUs use a strategy for hiding memory latencies as they run into an extreme form of the [[Random-access_memory#Memory_wall|Memory wall]].{{Citation needed|date=July 2025}} As shown in [[Flynn's taxonomy|Flynn's 1972 paper]] the key distinguishing factor of SIMT-based GPUs is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This is ''significantly'' more complex and involved than [[Flynn's Taxonomy#Pipelined processor|"Packed SIMD"]], which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW<ref>[http://miaowgpu.org/ MIAOW Vertical Research Group]</ref> team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture.<ref>[https://github.com/VerticalResearchGroup/miaow/wiki/Architecture-Overview MIAOW GPU]</ref>
 
Modern graphics processing units ([[GPUs]]) include an array of [[shaders|shader pipelines]] which may be driven by [[compute kernel]]s, and being usually SIMT are frequently miscategorised as vector processors.{{Citation needed|date=July 2025}} GPUs use a strategy for hiding memory latencies.{{Citation needed|date=July 2025}} As shown in [[Flynn's taxonomy|Flynn's 1972 paper]] the key distinguishing factor of SIMT-based GPUs is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This is ''significantly'' more complex and involved than [[Flynn's Taxonomy#Pipelined processor|"Packed SIMD"]], which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW<ref>[http://miaowgpu.org/ MIAOW Vertical Research Group]</ref> team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture.<ref>[https://github.com/VerticalResearchGroup/miaow/wiki/Architecture-Overview MIAOW GPU]</ref>
 
=== Recent development ===
Line 56 ⟶ 55:
* '''Pure (fixed) SIMD''' - also known as "Packed SIMD",<ref>{{cite conference|first1=Y.|last1=Miyaoka|first2=J.|last2=Choi|first3=N.|last3=Togawa|first4=M.|last4=Yanagisawa|first5=T.|last5=Ohtsuki|title=An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions|conference=Asia-Pacific Conference on Circuits and Systems|date=2002|pages=171–176|volume=1|doi=10.1109/APCCAS.2002.1114930|hdl=2065/10689|hdl-access=free}}</ref> [[SIMD within a register]] (SWAR), and [[Flynn's taxonomy#Pipelined processor|Pipelined Processor]] in Flynn's Taxonomy. Common examples using SIMD with features inspired by vector processors include: Intel x86's [[MMX (instruction set)|MMX]], [[Streaming SIMD Extensions|SSE]] and [[Advanced Vector Extensions|AVX]] instructions, AMD's [[3DNow!]] extensions, [[ARM NEON]], Sparc's [[Visual Instruction Set|VIS]] extension, [[PowerPC]]'s [[AltiVec]] and MIPS' [[MIPS_architecture#Application-specific_extensions|MSA]]. In 2000, [[IBM]], [[Toshiba]] and [[Sony]] collaborated to create the [[Cell processor]], which is also SIMD.
* '''Predicated SIMD''' - Two notable examples which have per-element [[Predication_(computer_architecture)|predication]] are [[Scalable Vector Extension|ARM SVE2]] and [[AVX-512]]
* '''Pure Vectors''' - as categorised in [[Duncan's taxonomy#Pipelined vector processors|Duncan's taxonomy]] - these include the original [[Cray-1]], [[Convex Computer|Convex C-Series]], [[NEC SX]], [[Fujitsu VP]] series, [[IBM 3090#Vector facility|IBM 3090 Vector facility]] and [[RISC-V#Vector set|RISC-V RVV]]. Although memory-based, both the [[TI_Advanced_Scientific_Computer#Architecture|TI ASC]] and the [[CDC STAR-100]] was also awere vector processorprocessors.
 
Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as [[MIMD]] (Multiple Instruction, Multiple Data) and realized with [[VLIW]] (Very Long Instruction Word) and [[Explicitly parallel instruction computing|EPIC]] (Explicitly Parallel Instruction Computing). The [[Fujitsu FR-V]] low-power multimedia processor combines VLIW with [[Predication_(computer_architecture)|predication]] and Packed SIMD.<ref>* {{Cite book
Line 70 ⟶ 69:
 
Pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors). Close analysis and comparison of historic and modern ISAs shows actual vector ISAs to have
a way to set the vector length at runtime, and to a non-power-of-two quantity. In other words the number of elements is ''not'' hard-encoded in the instruction as it is in SIMD ISA. This is explained in the "SIMD Considered harmful" article.<ref>{{cite web | title=SIMD Instructions Considered Harmful | date=18 September 2017 | url=https://www.sigarch.org/simd-instructions-considered-harmful/ |author1=David Patterson |author1-link=David Patterson |author2=Andrew Waterman |work=Computer Architecture Today |quote=Unlike SIMD, it has a vector length register vl, which makes the vector instructions work at any value of n.}}</ref>
 
<blockquote>
Unlike SIMD, it has a vector length register vl, which makes the vector instructions work at any value of n.
</blockquote>
 
* the original [[Cray-1]] {{code|0020}} instruction<ref>{{cite book |url=https://bitsavers.org/pdf/cray/CRAY-1/HR-0808_CRAY-1S_HWref_Nov81.pdf |title=Cray-1 S Series Hardware Reference Manual |date=November 1981 |publisher=Cray Research}}</ref>
Line 98 ⟶ 93:
 
==== Vector chaining ====
[[File:SimdComparison SIMD vs vectorVector.png|thumb|500px]]
 
Vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through [[Chaining (vector processing)|vector chaining]].<ref>{{Cite web|url=http://thebeardsage.com/vector-architecture/|title = Vector Architecture|date = 27 April 2020}}</ref><ref>[http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture11-vector.pdf Vector and SIMD processors, slides 12-13]</ref>
Line 333 ⟶ 328:
</syntaxhighlight>
 
This is essentially not very different from the SIMD version (processes 4 data elements per loop if MVL is 4), or from the initial Scalar version (processes just the one). n still contains the number of data elements remaining to be processed, but t0 contains the copy of VL – the number that is ''going'' to be processed in each iteration. t0 is subtracted from n after each iteration, and if n is zero then all elements have been processed.
 
A number of things to note, when comparing against the Predicated SIMD assembly variant:
Line 363 ⟶ 358:
<syntaxhighlight lang=gas>
VLVCU GR4 # Load VCT, update GR4 & CCode
vloop:
loop:
vld32 v0, x, v0 # load vector x, update v0
vld32 v1, y # load vector y (no update v1)
Line 369 ⟶ 364:
vst32 v1, y, v1 # store Y, now update v1
VLVCU GR4 # Load VCT, update GR4 & CCode
BC 3,loop vloop # Branch back if VCT>0
</syntaxhighlight>
 
Line 375 ⟶ 370:
 
<syntaxhighlight lang=gas>
# IBM 370 VLVCU does the following:
setvl GR4, n # VL=GR4=min(MVL, n)
subsetvl GR4 n, GR4 # n -VL=min(MVL, VL (GR4), also set CC
sub GR4, VL # n -= VL (GR4), also set CC
</syntaxhighlight>
 
Further, both the Vector-Load and Vector-Store instructions could perform the update of the address registers, saving on explicit instructions to perform address-calculations that ''in effect'' had already been done, behind the scenes, when loading/storing each element of the Vector.
 
The exact same principle is used in [[Libre-SOC]], illustrated with a DAXPY (64-bit floating-point) example:<ref>{{cite web | title=Daxpy example | url=https://libre-soc.org/openpower/sv/cookbook/daxpy_example/ }}</ref>
 
<!-- Copyright (c) 2023,2024 Luke Kenneth Casson Leighton -->
<!-- This assembler code is hereby CC BY-SA 4.0 Licensed by the author, [[User:Lkcl]] -->
<!-- the assembly is also released under the GFDL on https://libre-soc.org -->
<syntaxhighlight lang=gas>
# r5: n count; r6: x ptr; r7: y ptr; fp1: a
mtctr 5 # move n to CTR
vloop:
setvl MAXVL=32,VL=CTR # actually VL=MIN(MAXVL,CTR)
# DAXPY is 8-byte, hence 8(r6) and 8(r7)
sv.lfdup *32,8(r6) # load x into fp32-63, incr x
sv.lfd/els *64,8(r7) # load y into fp64-95, NO INC
sv.fmadd *64,*64,1,*32 # (*y) = (*y) * (*x) + a
sv.stfdup *64,8(r7) # store at y, post-incr y
sv.bc/ctr vloop # decr CTR by VL, jump !zero
</syntaxhighlight>
 
Post-update modes are added to Vectorised versions of Load-with-Update and Store-with-Update, again saving on the need for a separate instruction for address computation when the CPU's Load/Store internals has already done it. Note again that as with the IBM 370 example: the Vector y address is updated by the Store-with-Update, not a Load-with-Update.
The [[Power ISA]] CTR register is used for the total count instead of a scalar register, and the {{code|sv.bc/CTR}} instruction performs the reduction of CTR by the current Vector Length (VL) followed by testing CTR for being zero and branching if it is not.
 
The approach is different but achieves the same end-result as the IBM 3090: a significant compacting of instructions that are already considered highly compact.<ref>{{cite web | title=SIMD Instructions Considered Harmful | date=18 September 2017 | url=https://www.sigarch.org/simd-instructions-considered-harmful/ }}</ref> a [[RISC-V]] Vector DAXPY inner loop example is 10 instructions, where Libre-SOC and IBM 370 as shown above are both 6: a 40% reduction.
 
=== Vector reduction example ===
Line 431 ⟶ 450:
but with 4-wide SIMD being incapable '''by design''' of adding {{code|x[0]+x[1]}} for example, things go rapidly downhill just as they did with the general case of using SIMD for general-purpose IAXPY loops. To sum the four partial results, two-wide SIMD can be used, followed by a single scalar add, to finally produce the answer, but, frequently, the data must be transferred out of dedicated SIMD registers before the last scalar computation can be performed.
 
Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. OtherThe problems compound as the SIMD hardware width increases (an entirely new set of instruction is introduced), and, worse, some instructions to do Horizontal-sum such as the AMD [[XOP_instruction_set]], were ''removed'' after a few years. Some techniques involve shuffle:
examples online that involve highly detailed knowledge can be found for [[AVX-512]] and [[Streaming SIMD Extensions|SSE]] of how to do "Horizontal Sum"<ref>{{Cite web|url=https://stackoverflow.com/questions/52782940/1-to-4-broadcast-and-4-to-1-reduce-in-avx-512|title = Sse - 1-to-4 broadcast and 4-to-1 reduce in AVX-512}}</ref><ref>{{Cite web|url=https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026|title = Assembly - Fastest way to do horizontal SSE vector sum (Or other reduction)}}</ref>
 
Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors.
Line 470 ⟶ 489:
Compared to any SIMD processor claiming to be a vector processor, the order of magnitude reduction in program size is almost shocking. However, this level of elegance at the ISA level has quite a high price tag at the hardware level:
# From the IAXPY example, it can be seen that unlike SIMD processors, which can simplify their internal hardware by avoiding dealing with misaligned memory access, a vector processor cannot get away with such simplification: algorithms are written which inherently rely on Vector Load and Store being successful, regardless of alignment of the start of the vector.
# Whilst from the reduction example it can be seen that, aside from [[permute instruction]]sinstructions, SIMD by definition avoids inter-lane operations entirely (element 0 can only be added to another element 0), vector processors tackle this head-on. What programmers are forced to do in software (using shuffle and other tricks, to swap data into the right "lane") vector processors must do in hardware, automatically.
 
Overall then there is a choice to either have
Line 484 ⟶ 503:
* '''Vector Load and Store''' – Vector architectures with a register-to-register design (analogous to load&ndash;store architectures for scalar processors) have instructions for transferring multiple elements between the memory and the vector registers. Typically, multiple addressing modes are supported.
** The unit-stride addressing mode is essential: elements are contiguous.
** modern vector architectures typically also support arbitrary constant strides (offsets),
** asmodern wellarchitectures asalso have the scatter/gather (also called ''indexed'') addressing modemodes.
** Advanced architectures may also include support for ''segment'' load and stores,. Segment loads read a vector from memory, where each element is a [[data structure]] containing multiple members. The members are extracted from data structure (element), and each extracted member is placed into a different vector register.
** Some advanced architectures have ''fail-first'' variants of the standard vector load and stores (explained below)
Line 490 ⟶ 510:
* '''Masked Operations''' – [[Predication (computer architecture)|predicate masks]] allow parallel if/then/else constructs without resorting to branches. This allows code with conditional statements to be vectorized.
* '''Compress and Expand''' – usually using a bit-mask, data is linearly compressed or expanded (redistributed) based on whether bits in the mask are set or clear, whilst always preserving the sequential order and never duplicating values (unlike Gather-Scatter aka permute). These instructions feature in [[AVX-512#Compress and expand|AVX-512]].
* '''Register Gather, Scatter (aka permute)'''<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions RVV register gather-scatter instructions]</ref> – a less restrictive more generic variation of the compress/expand theme which instead takes one vector to specify the indices to use to "reorder" another vector. Gather/scatter is more complex to implement than compress/expand, and, being inherently non-sequential, can interfere with [[Chaining (vector processing)|vector chaining]]. Not to be confused with [[Gather-scatter]] Memory Load/Store modes, Gather/scatter vector operations act on the vector registers, and are often termed a [[permute instruction]] instead.
* '''Splat and Extract''' – useful for interaction between scalar and vector, these broadcast a single value across a vector, or extract one item from a vector, respectively.
* '''Iota''' – a very simple and strategically useful instruction which drops sequentially-incrementing immediates into successive elements. Usually starts from zero.
Line 503 ⟶ 523:
 
* '''Sub-vectors''' – elements may typically contain two, three or four sub-elements (vec2, vec3, vec4) where any given bit of a predicate mask applies to the whole vec2/3/4, not the elements in the sub-vector. Sub-vectors are also introduced in RISC-V RVV (termed "LMUL").<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#mapping-for-lmul-1-2 LMUL > 1 in RVV]</ref> Subvectors are a critical integral part of the [[Vulkan]] [[SPIR-V]] spec.
* '''Sub-vector Swizzle[[Swizzling (computer graphics)|swizzle]]''' – aka "Lane Shuffling" which allows sub-vector inter-element computations without needing extra (costly, wasteful) instructions to move the sub-elements into the correct SIMD "lanes" and also saves predicate mask bits. Effectively an in-flight [[permute instruction|mini-permute]] of the sub-vector, this heavily features in 3D Shader binaries (as high as 20% of all instructions<ref>{{cite web | title=Tom Forsyth - the Lifecycle of an Instruction Set | date=22 August 2020 | url=https://vimeo.com/450406346 }}</ref>) and is sufficiently important as to be part of the Vulkan SPIR-V spec. The Broadcom [[Videocore]] IV uses the terminology "Lane rotate"<ref>[https://patents.google.com/patent/US20110227920 Abandoned US patent US20110227920-0096]</ref> where the rest of the industry uses the term [[Swizzling (computer graphics)|"swizzle"]].<ref>[https://github.com/hermanhermitage/videocoreiv-qpu Videocore IV QPU]</ref>
* '''Transcendentals''' – [[trigonometric]] operations such as [[sine]], [[cosine]] and [[logarithm]] obviously feature much more predominantly in 3D than in many demanding [[High-performance computing|HPC]] workloads. Of interest, however, is that speed is far more important than accuracy in 3D for GPUs, where computation of pixel coordinates simply do not require high precision. The Vulkan specification recognises this and sets surprisingly low accuracy requirements, so that GPU Hardware can reduce power usage. The concept of reducing accuracy where it is simply not needed is explored in the [[MIPS-3D]] extension.