Content deleted Content added
m v1.41 - Repaired 2 links to disambiguation pages - (You can help) - IPC, WAW, 1 to be fixed - WAR |
Add: chapter, title. | Use this tool. Report bugs. | #UCB_Gadget |
||
(18 intermediate revisions by 15 users not shown) | |||
Line 1:
{{short description|Microarchitecture of a microprocessor designed to serve a serial computing thread with low latency}}
'''Latency oriented processor architecture''' is the [[microarchitecture]] of a [[microprocessor]] designed to serve a serial computing [[Thread (computing)|thread]] with a low latency. This is typical of most [[
==
{{Main|Flynn's taxonomy}}
Typically, latency oriented processor architectures execute a single task operating on a single data stream, and so they are [[Single instruction, single data|SISD]] under Flynn's taxonomy. Latency oriented processor architectures might also include [[Single instruction, multiple data|SIMD]] instruction set extensions such as Intel [[MMX (instruction set)|MMX]] and [[Streaming SIMD Extensions|SSE]]; even though these extensions operate on large data sets, their primary goal is to reduce overall latency.<ref name=YanSohilin2016/>
==Implementation techniques==
There are many architectural techniques employed to reduce the overall latency for a single computing task. These typically involve adding additional hardware in the [[
The design space of micro-architectural techniques is very large. Below are some of the most commonly employed techniques to reduce the overall latency for a thread.
===Instruction set architecture (ISA)===
{{Main
Most architectures today use shorter and simpler instructions, like the [[load/store architecture]], which help in optimizing the instruction pipeline for faster execution. Instructions are usually all of the same size which also helps in optimizing the instruction fetch logic. Such an ISA is called a [[Reduced instruction set computing|RISC]] architecture.<ref>{{cite conference|last1=Bhandarkar|first1=Dileep|last2=Clark|first2=Douglas W. |title=
===Instruction
{{Main
Pipelining overlaps execution of multiple instructions from the same executing thread in order to increase clock frequency or to increase the number of instructions that complete per unit time; thereby reducing the overall execution time for a thread. Instead of waiting for a single instruction to complete all its execution stages, multiple instructions are processed simultaneously, at their respective stages inside the pipeline. {{efn|''Computer Organization and Design: The Hardware/software Interface'', Chapter 4<ref name="interface"/>}}
===Register-renaming===
{{Main
This technique is used to effectively increase the total register file size than that specified in the ISA to programmers, and to eliminate false dependencies. Suppose we have two consecutive instructions which reference the same register. The first reads the register while the second writes to it. To maintain correctness of the program, it is essential to make sure that the second instruction does not write to the register before the first can read its original value. This is an example of a [[
===Memory
{{Main
The different levels of memory, which includes [[Cache (computing)|caches]], [[main memory]] and [[non-volatile storage]] like hard disks (where the program instructions and data reside), are designed to exploit [[Locality of reference|spatial locality]] and [[temporal locality]] to reduce the total [[memory access time]]. The less time the processor spends waiting for data to be fetched from memory, the lower number of instructions consume pipeline resources while just sitting idle and doing no useful work. The instruction pipeline will be completely stalled if all
===Speculative
{{Main
A major cause for pipeline stalls are control flow dependencies, i.e. when the outcome of a branch instruction is not known in advance (which is usually the case). Many architectures today use branch predictor components to guess the outcome of a branch. Execution continues along the predicted path for the program but instructions are tagged as speculative. If the guess turns out to be correct, then the instructions are allowed to complete successfully and to update their results back to register file/memory. If the guess was incorrect, then all speculative instructions are flushed from the pipeline and execution (re)starts along the actual correct path for the program. By maintaining a high prediction accuracy, the pipeline is able to significantly increase throughput for the executing thread. {{efn|''Computer Architecture: A Quantitative Approach'', Section 3.3<ref name="quant"/>}}
===Out-of-order execution===
Not all instructions in a thread take the same amount of time to execute. Superscalar pipelines usually have multiple possible paths for instructions depending upon current state and the instruction type itself. Hence, to increase [[instructions per cycle]] (IPC) the pipeline allows execution of instructions out-of-order so that instructions later in the program are not stalled due to an instruction which will take longer to complete. All instructions are registered in a re-order buffer when they are fetched by the pipeline and allowed to retire (i.e. write back their results) in the order of the original program so as to maintain correctness. {{efn|''Computer Architecture: A Quantitative Approach'', Sections 3.4, 3.5<ref name="quant"/>}}
===Superscalar
{{Main
A super-scalar instruction pipeline pulls in multiple instructions in every clock cycle, as opposed to a simple scalar pipeline. This increases [[Instruction level parallelism]] (ILP) as many times as the number of instructions fetched in each cycle, except when the pipeline is stalled due to data or control flow dependencies. Even though the retire rate of superscalar pipelines is usually less than their fetch rate, the overall number of instructions executed per unit time (> 1) is generally greater than a scalar pipeline. {{efn|''Computer Architecture: A Quantitative Approach'', Sections 3.6-3.8<ref name="quant"/>}}
==Contrast with
In contrast, a [[throughput]] oriented processor architecture is designed to maximize the amount of 'useful work' done in a significant window of time. Useful work refers to large calculations on a significant amount of data. They do this by parallelizing the work load so that many calculations can be performed simultaneously. The calculations may belong to a single task or a limited number of multiple tasks. The total time required to complete 1 execution is significantly larger than that of a latency oriented processor architecture, however, the total time to complete a large set of calculations is significantly reduced. Latency is often sacrificed in order to achieve a higher throughput per cycle.
Latency oriented processors expend a substantial chip area on sophisticated control structures like branch prediction, [[Operand forwarding|data forwarding]], [[re-order buffer]], large register files and caches in each processor. These structures help reduce operational latency and memory-access time per instruction, and make results available as soon as possible. Throughput oriented architectures on the other hand, usually have a multitude of processors with much smaller caches and simpler control logic. This helps to efficiently utilize the memory bandwidth and increase total the number of total number of execution units on the same chip area.
[[Graphics processing unit|GPUs]] are a typical example of throughput oriented processor architectures.
Line 49 ⟶ 51:
==References==
{{Reflist}}
[[Category:Microprocessors]]
|