Latency oriented processor architecture: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 21:00, 26 November 2018 edit Shellwood (talk \| contribs) Extended confirmed users, New page reviewers, Pending changes reviewers, Rollbackers 426,713 edits m Reverted edits by 174.255.193.162 (talk) (HG) (3.4.4) Tags: Huggle Rollback ← Previous edit		Latest revision as of 20:23, 6 June 2025 edit undo Headbomb (talk \| contribs) Edit filter managers, Autopatrolled, Extended confirmed users, Page movers, File movers, New page reviewers, Pending changes reviewers, Rollbackers, Template editors 472,964 edits Add: chapter, title. \| Use this tool. Report bugs. \| #UCB_Gadget
(11 intermediate revisions by 8 users not shown)
Line 1: {{short description\|Microarchitecture of a microprocessor designed to serve a serial computing thread with low latency}} ~~{{Orphan\|date=December 2016}}~~ '''Latency oriented processor architecture''' is the [[microarchitecture]] of a [[microprocessor]] designed to serve a serial computing [[Thread (computing)\|thread]] with a low latency. This is typical of most [[~~Central~~central ~~Processing~~processing ~~Unit~~unit]]s (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases.<ref>{{cite book\| author1=John Paul Shen \|author2=Mikko H. Lipasti \|year=2013 \|title=Modern Processor Design \|publisher=McGraw-Hill Professional \|isbn=~~1478607831~~978-1478607830}}</ref>{{page needed\|date=November 2016}} Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total [[throughput]] of the system, rather than the service [[Latency (engineering)\|latencies]] for all individual threads that they work on.<ref name=YanSohilin2016>{{cite book\|author=Yan Solihin \|year=2016 \|title=Fundamentals of Parallel Multicore Architecture \|publisher=Chapman & Hall/CRC Computational Science \|isbn=978-1482211184}}</ref>{{page needed\|date=November 2016}}<ref name=GarlandKirk>{{cite journal\|title=Understanding Throughput-Oriented Architectures \|author1=Michael Garland \|author2=David B. Kirk \|journal=Communications of the ACM \|volume=53 \|number=11 \|pages=58–66 \|doi=10.1145/1839676.1839694\|year=2010 \|doi-access=free }}</ref>▼ ▲'''Latency oriented processor architecture''' is the [[microarchitecture]] of a [[microprocessor]] designed to serve a serial computing [[Thread (computing)\|thread]] with a low latency. This is typical of most [[Central Processing Unit]]s (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases.<ref>{{cite book\| author1=John Paul Shen \|author2=Mikko H. Lipasti \|year=2013 \|title=Modern Processor Design \|publisher=McGraw-Hill Professional \|isbn=1478607831}}</ref>{{page needed\|date=November 2016}} Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total [[throughput]] of the system, rather than the service [[Latency (engineering)\|latencies]] for all individual threads that they work on.<ref name=YanSohilin2016>{{cite book\|author=Yan Solihin \|year=2016 \|title=Fundamentals of Parallel Multicore Architecture \|publisher=Chapman & Hall/CRC Computational Science \|isbn=978-1482211184}}</ref>{{page needed\|date=November 2016}}<ref name=GarlandKirk>{{cite journal\|title=Understanding Throughput-Oriented Architectures \|author1=Michael Garland \|author2=David B. Kirk \|journal=Communications of the ACM \|volume=53 \|number=11 \|pages=58–66 \|doi=10.1145/1839676.1839694}}</ref> ==Flynn's taxonomy== {{Main\|Flynn's taxonomy}} ~~Latency~~Typically, latency oriented processor architectures ~~would~~execute ~~normally~~a ~~fall~~single ~~into~~task ~~the~~operating ~~category~~on ofa single data stream, and so they are [[Single instruction, single data\|SISD]] ~~classification~~ under ~~flynn~~Flynn's taxonomy. ~~This implies a typical characteristic of latency~~Latency oriented processor architectures ismight toalso ~~execute~~include a[[Single ~~single~~instruction, ~~task operating on a single~~multiple data ~~stream. Some [[~~\|SIMD]] ~~style~~instruction ~~multimedia~~set extensions ~~of popular instruction sets,~~ such as Intel [[MMX (instruction set)\|MMX]] and [[Streaming SIMD Extensions\|SSE]] ~~instructions, should also fall under the category of latency oriented processor architectures~~;~~<ref~~ ~~name=YanSohilin2016/>~~even ~~because,~~though ~~although~~these ~~they~~extensions operate on a large data ~~set~~sets, their primary goal is ~~also~~ to reduce overall latency.<ref ~~for the entire task at hand.~~name=YanSohilin2016/> ==Implementation techniques== There are many architectural techniques employed to reduce the overall latency for a single computing task. These typically involve adding additional hardware in the [[Pipeline (computing)\|pipeline]] to serve instructions as soon as they are fetched from [[Random-access memory\|memory]] or [[CPU cache\|instruction cache]]. A notable characteristic of these architectures is that a significant area of the chip is used up in parts other than the [[Execution unit\|Execution Units]] themselves. This is because the intent is to bring down the time required to complete a 'typical' task in a computing environment. A typical computing task is a serial set of instructions, where there is a high dependency on results produced by the previous instructions of the same task. Hence, it makes sense that the microprocessor will be spending its time doing many other tasks other than the calculations required by the individual instructions themselves. If the [[Hazard (computer architecture)\|hazards]] encountered during computation are not resolved quickly, then latency for the thread increases. This is because hazards stall execution of subsequent instructions and, depending upon the pipeline implementation, may either stall progress completely until the dependency is resolved or lead to an avalanche of more hazards in future instructions; further exacerbating execution time for the thread.<ref name="quant">{{cite book\|author1=John L. Hennessy \|author2=David A. Patterson \|title=Computer Architecture: A Quantitative Approach \|edition=Fifth \|year=2013 \|publisher=Morgan Kaufmann Publishers \|isbn=~~012383872X~~978-0123838728}}</ref><ref name="interface">{{cite book\|author1=David A. Patterson \|author2=John L. Hennessy \|title=Computer Organization and Design: The Hardware/software Interface \|edition=Fifth \|year=2013 \|publisher=Morgan Kaufmann Publishers \|isbn=9780124078864}}</ref> The design space of micro-architectural techniques is very large. Below are some of the most commonly employed techniques to reduce the overall latency for a thread. ===Instruction set architecture (ISA)=== {{Main ~~article~~\|Instruction set}} Most architectures today use shorter and simpler instructions, like the [[load/store architecture]], which help in optimizing the instruction pipeline for faster execution. Instructions are usually all of the same size which also helps in optimizing the instruction fetch logic. Such an ISA is called a [[Reduced instruction set computing\|RISC]] architecture.<ref>{{cite conference\|last1=Bhandarkar\|first1=Dileep\|last2=Clark\|first2=Douglas W. \|title=~~Performance~~Proceedings ~~from~~of ~~Architecture:~~the ~~Comparing~~fourth ainternational ~~RISC~~conference ~~and~~on aArchitectural ~~CISC~~support ~~with~~for programming languages and operating systems ~~Similar~~- ~~Hardware~~ASPLOS-IV ~~Organization~~\|~~journal~~chapter=~~Proceedings~~Performance offrom ~~the~~architecture: ~~Fourth~~Comparing ~~International~~a ~~Conference~~RISC onand ~~Architectural~~a ~~Support~~CISC ~~for~~with ~~Programming~~similar ~~Languages and~~hardware ~~Operating~~organization ~~Systems~~\|date=1 January 1991\|pages=310–319\|doi=10.1145/106972.107003\|url=http://dl.acm.org/citation.cfm?id=107003&CFID=860927590&CFTOKEN=39315780\|publisher=ACM\|isbn=0897913809 \|doi-access=free}}</ref> ===Instruction pipelining=== {{Main ~~article~~\|instruction pipeline}} Pipelining overlaps execution of multiple instructions from the same executing thread in order to increase clock frequency or to increase the number of instructions that complete per unit time; thereby reducing the overall execution time for a thread. Instead of waiting for a single instruction to complete all its execution stages, multiple instructions are processed simultaneously, at their respective stages inside the pipeline. {{efn\|''Computer Organization and Design: The Hardware/software Interface'', Chapter 4<ref name="interface"/>}} ===Register-renaming=== {{Main ~~article~~\|Register renaming}} This technique is used to effectively increase the total register file size than that specified in the ISA to programmers, and to eliminate false dependencies. Suppose we have two consecutive instructions which reference the same register. The first reads the register while the second writes to it. To maintain correctness of the program, it is essential to make sure that the second instruction does not write to the register before the first can read its original value. This is an example of a [[Write after read\|Write-After-Read (WAR)]] dependency. To eliminate this dependency, the pipeline would 'rename' the instruction internally by assigning it to an internal register. The instruction is therefore allowed to execute and results produced by it will now be immediately available to all subsequent instructions, even though the actual destination register intended by the program will be written to later. Similarly if both the instructions simply meant to write to the same register [[data dependency\|Write-After-Write (WAW)]], the pipeline would rename them and ensure that their results are available to future instructions without the need to serialize their execution. {{efn\|''Computer Architecture: A Quantitative Approach'', Section 3.1<ref name="quant"/>}} ===Memory organization=== {{Main ~~article~~\|Memory hierarchy}} The different levels of memory, which includes [[Cache (computing)\|caches]], [[main memory]] and [[non-volatile storage]] like hard disks (where the program instructions and data reside), are designed to exploit [[Locality of reference\|spatial locality]] and [[temporal locality]] to reduce the total [[memory access time]]. The less time the processor spends waiting for data to be fetched from memory, the lower number of instructions consume pipeline resources while just sitting idle and doing no useful work. The instruction pipeline will be completely stalled if all its internal buffers (for example [[reservation station]]s) are filled to their respective capacities. Hence, if instructions consume ~~less number of~~fewer idle cycles while inside the pipeline, there is a greater chance of exploiting [[Instruction level parallelism]] (ILP) as the fetch logic can pull in greater number of instructions from the cache/memory per unit time. {{efn\|''Computer Organization and Design: The Hardware/software Interface'', Chapter 5<ref name="interface"/>}} ===Speculative execution=== {{Main ~~article~~\|Branch predictor}} A major cause for pipeline stalls are control flow dependencies, i.e. when the outcome of a branch instruction is not known in advance (which is usually the case). Many architectures today use branch predictor components to guess the outcome of a branch. Execution continues along the predicted path for the program but instructions are tagged as speculative. If the guess turns out to be correct, then the instructions are allowed to complete successfully and to update their results back to register file/memory. If the guess was incorrect, then all speculative instructions are flushed from the pipeline and execution (re)starts along the actual correct path for the program. By maintaining a high prediction accuracy, the pipeline is able to significantly increase throughput for the executing thread. {{efn\|''Computer Architecture: A Quantitative Approach'', Section 3.3<ref name="quant"/>}} ===Out-of-order execution=== {{Main ~~article~~\|Out-of-order execution}} Not all instructions in a thread take the same amount of time to execute. Superscalar pipelines usually have multiple possible paths for instructions depending upon current state and the instruction type itself. Hence, to increase [[instructions per cycle]] (IPC) the pipeline allows execution of instructions out-of-order so that instructions later in the program are not stalled due to an instruction which will take longer to complete. All instructions are registered in a re-order buffer when they are fetched by the pipeline and allowed to retire (i.e. write back their results) in the order of the original program so as to maintain correctness. {{efn\|''Computer Architecture: A Quantitative Approach'', Sections 3.4, 3.5<ref name="quant"/>}} ===Superscalar execution=== {{Main ~~article~~\|Superscalar}} A super-scalar instruction pipeline pulls in multiple instructions in every clock cycle, as opposed to a simple scalar pipeline. This increases [[Instruction level parallelism]] (ILP) as many times as the number of instructions fetched in each cycle, except when the pipeline is stalled due to data or control flow dependencies. Even though the retire rate of superscalar pipelines is usually less than their fetch rate, the overall number of instructions executed per unit time (> 1) is generally greater than a scalar pipeline. {{efn\|''Computer Architecture: A Quantitative Approach'', Sections 3.6-3.8<ref name="quant"/>}}