Content deleted Content added
m →Precise exceptions: Tag Bare URL PDFs using AutoWikiBrowser |
|||
(42 intermediate revisions by 15 users not shown) | |||
Line 1:
{{Short description|Computing paradigm to improve computational efficiency}}
{{Redirect|OOE||Ooe (disambiguation)}}
{{Use American English|date=January 2025}}
In [[computer engineering]], '''out-of-order execution''' (or more formally '''dynamic execution''') is
== History ==
Out-of-order execution is a restricted form of [[dataflow architecture
=== Early use in supercomputers ===
About two years later, the [[IBM System/360 Model 91]] (1966) introduced [[register renaming]] with [[Tomasulo's algorithm]],<ref>{{citation |title=An Efficient Algorithm for Exploiting Multiple Arithmetic Units |journal=[[IBM Journal of Research and Development]] |volume=11 |issue=1 |pages=25–33 |date=1967 |author-first=Robert Marco |author-last=Tomasulo |author-link=Robert Marco Tomasulo |doi=10.1147/rd.111.0025 |url=https://pdfs.semanticscholar.org/8299/94a1340e5ecdb7fb24dad2332ccf8de0bb8b.pdf |archive-url=https://web.archive.org/web/20180612141530/https://pdfs.semanticscholar.org/8299/94a1340e5ecdb7fb24dad2332ccf8de0bb8b.pdf |url-status=dead |archive-date=2018-06-12 |citeseerx=10.1.1.639.7540|s2cid=8445049 }}</ref> which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register ''r<sub>n</sub>'' can be executed before an earlier instruction using the register ''r<sub>n</sub>'' is executed, by actually writing into an alternative (renamed) register ''alt-r<sub>n</sub>'', which is turned into a normal
In the Model 91 the register renaming is implemented by a [[Operand forwarding|bypass]] termed ''Common Data Bus'' (CDB) and memory source operand buffers, leaving the physical architectural registers unused for many cycles as the oldest state of registers addressed by any unexecuted instruction is found on the CDB. Another advantage the Model 91 has over the 6600 is the ability to execute instructions out-of-order === Precise exceptions ===
To have [[precise
In the 1980s many early [[
=== Decoupling ===
Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching.<ref>{{cite journal |last1=Smith |first1=James E. |author1-link=James E. Smith (engineer) |title=Decoupled Access/Execute Computer Architectures |journal=ACM Transactions on Computer Systems |date=November 1984 |volume=2 |issue=4 |pages=289–308 |doi=10.1145/357401.357403 |s2cid=13903321 |url=https://course.ece.cmu.edu/~ece447/s15/lib/exe/fetch.php?media=p289-smith.pdf}}</ref> He implemented those ideas in the [[Astronautics Corporation of America|Astronautics]] ZS-1 (1988), featuring a decoupling of the integer/load/store [[Instruction pipelining|pipeline]] from the floating-point pipeline, allowing inter-pipeline
=== Research comes to fruition ===
With the [[POWER1]] (1990), IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a ''physical register file'' (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a
=== Wide adoption ===
The first [[
[[
The practically attainable [[instructions per cycle|per-cycle rate of execution]] rose
Almost all processors for phones and other lower-end applications remained in-order until
== Basic concept ==
=== Background ===
=== In-order processors ===
In earlier processors, the processing of instructions is performed in an [[instruction cycle]] normally consisting of the following steps:
# [[Instruction (computer science)|Instruction]]
# If input [[operand]]s are available (in processor registers, for instance), the instruction is dispatched to the appropriate [[functional unit]]. If one or more operands are unavailable during the current clock cycle (generally because they
# The instruction is executed by the appropriate functional unit.
# The functional unit writes the results back to the [[register file]].
Line 45 ⟶ 49:
=== Out-of-order processors ===
This new paradigm breaks up the processing of instructions into these steps:<ref>{{Cite journal |last1=González |first1=Antonio |last2=Latorre |first2=Fernando |last3=Magklis |first3=Grigorios |date=2011 |title=Processor Microarchitecture |url=https://link.springer.com/book/10.1007/978-3-031-01729-2 |journal=Synthesis Lectures on Computer Architecture |language=en |doi=10.1007/978-3-031-01729-2 |isbn=978-3-031-00601-2 |issn=1935-3235|url-access=subscription }}</ref>
# Instruction fetch.
# Instruction decoding.
Line 55 ⟶ 59:
# Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The key concept of
The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data The benefit of
== Dispatch and issue decoupling allows out-of-order issue ==
One of the differences created by the new paradigm is the creation of queues that
The [[Instruction cycle|fetch and decode stages]] is separated from the execute stage in a [[Pipeline (computing)|pipelined]] processor by using a [[Data buffer|buffer]]. The buffer's purpose is to partition the [[Memory access pattern|memory access]] and execute functions in a computer program and achieve high
A larger buffer can, in theory, increase throughput. However, if the processor has a [[branch misprediction]] then the entire buffer may need to be flushed, wasting a lot of [[clock cycle]]s and reducing the effectiveness. Furthermore, larger buffers create more heat and use more [[Die (integrated circuit)|die]] space. For this reason processor designers today
▲The buffer's purpose is to partition the [[Memory access pattern|memory access]] and execute functions in a computer program and achieve high-performance by exploiting the fine-grain [[parallel computing|parallelism]] between the two.<ref>{{cite journal |author-last=Smith |author-first1=J. E. |title=Decoupled access/execute computer architectures |journal= ACM Transactions on Computer Systems|date=1984 |volume=2 |issue=4 |pages=289–308 |citeseerx=10.1.1.127.4475 |doi=10.1145/357401.357403|s2cid=13903321 }}</ref> In doing so, it effectively hides all [[memory latency]] from the processor's perspective.
Decoupled architectures are generally thought of as not useful for general
▲A larger buffer can, in theory, increase throughput. However, if the processor has a [[branch misprediction]] then the entire buffer may need to be flushed, wasting a lot of [[clock cycle]]s and reducing the effectiveness. Furthermore, larger buffers create more heat and use more [[Die (integrated circuit)|die]] space. For this reason processor designers today favour a [[multi-threaded]] design approach.
▲Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well.<ref>{{cite journal |author-last1=Kurian |author-first1=L. |author-last2=Hulina |author-first2=P. T. |author-last3=Coraor |author-first3=L. D. |title=Memory latency effects in decoupled architectures |journal=[[IEEE Transactions on Computers]] |volume=43 |issue=10 |date=1994 |pages=1129–1139 |doi=10.1109/12.324539 |s2cid=6913858 |url=https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf |archive-url=https://web.archive.org/web/20180612141055/https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf |url-status=dead |archive-date=2018-06-12 }}</ref> Control intensive code include such things as nested branches that occur frequently in [[operating system]] [[kernel (operating system)|kernels]]. Decoupled architectures play an important role in scheduling in [[very long instruction word]] (VLIW) architectures.<ref>{{cite journal |author-first1=M. N. |author-last1=Dorojevets |author-first2=V. |author-last2=Oklobdzija |title=Multithreaded decoupled architecture |journal=International Journal of High Speed Computing |volume=7 |issue=3 |pages=465–480 |date=1995 |doi=10.1142/S0129053395000257 |url=https://www.researchgate.net/publication/220171480}}</ref>
== Execute and writeback decoupling allows program restart ==
The queue for results is necessary to resolve issues such as branch mispredictions and exceptions
== Micro-architectural choices ==
:[[IBM
▲:[[IBM]] [[PowerPC]] processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the term ''reservation stations'' for their distributed queues.
:Early Intel out-of-order processors use a results queue called a [[reorder buffer]],{{efn|Intel [[P6 (microarchitecture)|P6]] family microprocessors have both a reorder buffer (ROB) and a [[register renaming|register alias table]] (RAT). The ROB was motivated mainly by branch misprediction recovery. The Intel [[P6 (microarchitecture)|P6]] family is among the earliest
▲* Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight.
▲:The Intel [[P6 (microarchitecture)|P6]] family is among the earliest OoOE microprocessors but were supplanted by the [[NetBurst]] architecture. Years later, NetBurst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies. Materials were not able to match the design's ambitious clock targets due to thermal issues and later designs based on NetBurst, namely Tejas and Jayhawk, were cancelled. Intel reverted to the P6 design as the basis of the [[Intel Core (microarchitecture)|Core]] and [[Nehalem (microarchitecture)|Nehalem]] microarchitectures. The succeeding [[Sandy Bridge]], [[Ivy Bridge (microarchitecture)|Ivy Bridge]], and [[Haswell (microarchitecture)|Haswell]] microarchitectures are a departure from the reordering techniques used in P6 and employ re-ordering techniques from the [[Alpha 21264|EV6]] and the [[Pentium 4|P4]] but with a somewhat shorter pipeline.<ref>{{cite web |author-last=Kanter |author-first=David |date=2010-09-25 |title=Intel's Sandy Bridge Microarchitecture |url=http://www.realworldtech.com/sandy-bridge/10/}}</ref><ref name="urlThe Haswell Front End - Intels Haswell Architecture Analyzed: Building a New PC and a New Intel">{{cite web |url=https://www.anandtech.com/show/6355/intels-haswell-architecture/6 |title=The Haswell Front End - Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel }}</ref>
== See also ==
{{Wikibooks | Microprocessor Design | Out Of Order Execution }}
* [[
* [[Replay system]]
* [[Shelving buffer]]
== Notes ==
{{Notelist}}
== References ==
|