Revision as of 03:20, 17 November 2016 edit Anomalocaris (talk \| contribs) Extended confirmed users, Pending changes reviewers 92,487 edits sp. Science; rm space before <ref> ← Previous edit		Revision as of 05:09, 25 December 2016 edit undo BattyBot (talk \| contribs) Bots 1,957,349 edits m fixed citation template(s) to remove page from Category:CS1 maint: Extra text & general fixes, added orphan tag using AWB Next edit →
Line 1: {{Orphan\|date=December 2016}} '''Latency oriented processor architecture''' is the [[microarchitecture]] of a [[microprocessor]] designed to serve a serial computing [[Thread (computing)\|thread]] with a low latency. This is typical of most [[Central Processing Unit]]s (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases.<ref>{{cite book\| author1=John Paul Shen \|author2=Mikko H. Lipasti \|year=2013 \|title=Modern Processor Design \|publisher=McGraw-Hill Professional \|isbn=1478607831}}</ref>{{page?\|date=November 2016}} Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total [[throughput]] of the system, rather than the service [[Latency (engineering)\|latencies]] for all individual threads that they work on.<ref name=YanSohilin2016>{{cite book\|author=Yan Solihin \|year=2016 \|title=Fundamentals of Parallel Multicore Architecture \|publisher=Chapman & Hall/CRC Computational Science \|isbn=978-1482211184}}</ref>{{page?\|date=November 2016}}<ref name=GarlandKirk>{{cite journal\|title=Understanding Throughput-Oriented Architectures \|author1=Michael Garland \|author2=David B. Kirk \|journal=Communications of the ACM \|volume=53 \|number=11 \|pages=58-66}}</ref>▼ ▲'''Latency oriented processor architecture''' is the [[microarchitecture]] of a [[microprocessor]] designed to serve a serial computing [[Thread (computing)\|thread]] with a low latency. This is typical of most [[Central Processing Unit]]s (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases.<ref>{{cite book\| author1=John Paul Shen \|author2=Mikko H. Lipasti \|year=2013 \|title=Modern Processor Design \|publisher=McGraw-Hill Professional \|isbn=1478607831}}</ref>{{page? needed\|date=November 2016}} Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total [[throughput]] of the system, rather than the service [[Latency (engineering)\|latencies]] for all individual threads that they work on.<ref name=YanSohilin2016>{{cite book\|author=Yan Solihin \|year=2016 \|title=Fundamentals of Parallel Multicore Architecture \|publisher=Chapman & Hall/CRC Computational Science \|isbn=978-1482211184}}</ref>{{page? needed\|date=November 2016}}<ref name=GarlandKirk>{{cite journal\|title=Understanding Throughput-Oriented Architectures \|author1=Michael Garland \|author2=David B. Kirk \|journal=Communications of the ACM \|volume=53 \|number=11 \|pages=~~58-66~~58–66}}</ref> ==[[Flynn's taxonomy]]== Latency oriented processor architectures would normally fall into the category of [[SISD]] classification under flynn's taxonomy. This implies a typical characteristic of latency oriented processor architectures is to execute a single task operating on a single data stream. Some [[SIMD]] style multimedia extensions of popular instruction sets, such as Intel [[MMX (instruction set)\|MMX]] and [[Streaming SIMD Extensions\|SSE]] instructions, should also fall under the category of latency oriented processor architectures; <ref name=YanSohilin2016/> because, although they operate on a large data set, their primary goal is also to reduce overall latency for the entire task at hand. ==Implementation techniques== There are many architectural techniques employed to reduce the overall latency for a single computing task. These typically involve adding additional hardware in the [[~~Pipeline_~~Pipeline (computing)\|pipeline]] to serve instructions as soon as they are fetched from [[Random-~~access_memory~~access memory\|memory]] or [[CPU cache\|instruction cache]]. A notable characteristic of these architectures is that a significant area of the chip is used up in parts other than the [[~~Execution_unit~~Execution unit\|Execution Units]] themselves. This is because the intent is to bring down the time required to complete a 'typical' task in a computing environment. A typical computing task is a serial set of instructions, where there is a high dependency on results produced by the previous instructions of the same task. Hence, it makes sense that the microprocessor will be spending its time doing many other tasks other than the calculations required by the individual instructions themselves. If the [[~~Hazard_~~Hazard (~~computer_architecture~~computer architecture)\|hazards]] encountered during computation are not resolved quickly, then latency for the thread increases. This is because hazards stall execution of subsequent instructions and, depending upon the pipeline implementation, may either stall progress completely until the dependency is resolved or lead to an avalanche of more hazards in future instructions; further exacerbating execution time for the thread.<ref name="quant">{{cite book\|author1=John L. Hennessy \|author2=David A. Patterson \|title=Computer Architecture: A Quantitative Approach \|edition=Fifth ~~Edition~~ \|year=2013 \|publisher=Morgan Kaufmann Publishers \|isbn=012383872X}}</ref><ref name="interface">{{cite book\|author1=David A. Patterson \|author2=John L. Hennessy \|title=Computer Organization and Design: The Hardware/software Interface \|edition=Fifth ~~edition~~ \|year=2013 \|publisher=Morgan Kaufmann Publishers \|isbn=9780124078864}}</ref> The design space of micro-architectural techniques is very large. Below are some of the most commonly employed techniques to reduce the overall latency for a thread. Line 23 ⟶ 25: ===Memory Organization=== {{Main article\|Memory hierarchy}} The different levels of memory, which includes [[Cache (computing)\|caches]], [[main memory]] and [[non-volatile storage]] like hard disks (where the program instructions and data reside), are designed to exploit [[Locality of reference\|spatial locality]] and [[temporal locality]] to reduce the total [[memory access time]]. The less time the processor spends waiting for data to be fetched from memory, the lower number of instructions consume pipeline resources while just sitting idle and doing no useful work. The instruction pipeline will be completely stalled if all it's internal buffers (for example [[~~Reservation~~reservation station~~\|reservation stations~~]]s) are filled to their respective capacities. Hence, if instructions consume less number of idle cycles while inside the pipeline, there is a greater chance of exploiting [[Instruction level parallelism]] (ILP) as the fetch logic can pull in greater number of instructions from the cache/memory per unit time. {{efn\|''Computer Organization and Design: The Hardware/software Interface'', Chapter 5<ref name="interface"/>}} ===Speculative Execution=== Line 38 ⟶ 40: ==Contrast with Throughput oriented processor architectures== In contrast, a [[throughput]] oriented processor architecture is designed to maximize the amount of 'useful work' done in a significant window of time. Useful work refers to large calculations on a significant amount of data. They do this by parallelizing the work load so that many calculations can be performed simultaneously. The calculations may belong to a single task or a limited number of multiple tasks. The total time required to complete 1 execution is significantly larger than that of a latency oriented processor architecture, however, the total time to complete a large set of calculations is significantly reduced. Latency is often sacrificed in order to achieve a higher throughput per cycle. <ref name=GarlandKirk/> As a result, a latency oriented processor may complete a single calculation significantly faster than a throughput-oriented processor; however, the throughput-oriented processor could be partway through hundreds of such computations by the time the latency oriented processor completes 1 calculation. <ref name=YanSohilin2016/> Latency oriented processors expend a substantial chip area on sophisticated control structures like branch prediction, [[Operand forwarding\|data forwarding]], [[re-order buffer]], large register files and caches in each processor. These structures help reduce operational latency and memory-access time per instruction, and make results available as soon as possible. Throughput oriented architectures on the other hand, usually have a multitude of processors with much smaller caches and simpler control logic. This helps to efficiently utilize the memory bandwidth and increase total the number of total number of execution units on the same chip area. <ref name=GarlandKirk/> [[Graphics processing unit\|GPUs]] are a typical example of throughput oriented processor architectures. Line 49 ⟶ 51: ==References== {{Reflist}} [[Category:Microprocessors]]

Latency oriented processor architecture: Difference between revisions