Content deleted Content added
start review: ce for clarity. WP:ASTONISH. rm unnec paren. |
m →Precise exceptions: Tag Bare URL PDFs using AutoWikiBrowser |
||
(40 intermediate revisions by 14 users not shown) | |||
Line 1:
{{Short description|Computing paradigm to improve computational efficiency}}
{{Redirect|OOE||Ooe (disambiguation)}}
{{Use American English|date=January 2025}}
In [[computer engineering]], '''out-of-order execution''' (or more formally '''dynamic execution''') is
== History ==
Line 8 ⟶ 9:
=== Early use in supercomputers ===
About two years later, the [[IBM System/360 Model 91]] (1966) introduced [[register renaming]] with [[Tomasulo's algorithm]],<ref>{{citation |title=An Efficient Algorithm for Exploiting Multiple Arithmetic Units |journal=[[IBM Journal of Research and Development]] |volume=11 |issue=1 |pages=25–33 |date=1967 |author-first=Robert Marco |author-last=Tomasulo |author-link=Robert Marco Tomasulo |doi=10.1147/rd.111.0025 |url=https://pdfs.semanticscholar.org/8299/94a1340e5ecdb7fb24dad2332ccf8de0bb8b.pdf |archive-url=https://web.archive.org/web/20180612141530/https://pdfs.semanticscholar.org/8299/94a1340e5ecdb7fb24dad2332ccf8de0bb8b.pdf |url-status=dead |archive-date=2018-06-12 |citeseerx=10.1.1.639.7540|s2cid=8445049 }}</ref> which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register ''r<sub>n</sub>'' can be executed before an earlier instruction using the register ''r<sub>n</sub>'' is executed, by actually writing into an alternative (renamed) register ''alt-r<sub>n</sub>'', which is turned into a normal register ''r<sub>n</sub>'' only when all the earlier instructions addressing ''r<sub>n</sub>'' have been executed, but until then ''r<sub>n</sub>'' is given for earlier instructions and ''alt-r<sub>n</sub>'' for later ones addressing ''r<sub>n</sub>''.
In the Model 91 the register renaming is implemented by a [[Operand forwarding|bypass]] termed ''Common Data Bus'' (CDB) and memory source operand buffers, leaving the physical architectural registers unused for many cycles as the oldest state of registers addressed by any unexecuted instruction is found on the CDB. Another advantage the Model 91 has over the 6600 is the ability to execute instructions out-of-order in the same [[execution unit]], not just between the units like the 6600{{Disputed inline|date=July 2025}}. This is accomplished by [[reservation station]]s, from which instructions go to the execution unit when ready, as opposed to the FIFO queue{{Disputed inline|date=July 2025}} of each execution unit of the 6600. The Model 91 is also capable of
=== Precise exceptions ===
To have [[precise
In the 1980s many early [[
=== Decoupling ===
Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching.<ref>{{cite journal |last1=Smith |first1=James E. |author1-link=James E. Smith (engineer) |title=Decoupled Access/Execute Computer Architectures |journal=ACM Transactions on Computer Systems |date=November 1984 |volume=2 |issue=4 |pages=289–308 |doi=10.1145/357401.357403 |s2cid=13903321 |url=https://course.ece.cmu.edu/~ece447/s15/lib/exe/fetch.php?media=p289-smith.pdf}}</ref> He implemented those ideas in the [[Astronautics Corporation of America|Astronautics]] ZS-1 (1988), featuring a decoupling of the integer/load/store [[Instruction pipelining|pipeline]] from the floating-point pipeline, allowing inter-pipeline
=== Research comes to fruition ===
With the [[POWER1]] (1990), IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a ''physical register file'' (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a
=== Wide adoption ===
The first [[
[[
The practically attainable [[instructions per cycle|per-cycle rate of execution]] rose
Almost all processors for phones and other lower-end applications remained in-order until
== Basic concept ==
=== Background ===
=== In-order processors ===
In earlier processors, the processing of instructions is performed in an [[instruction cycle]] normally consisting of the following steps:
# [[Instruction (computer science)|Instruction]]
# If input [[operand]]s are available (in processor registers, for instance), the instruction is dispatched to the appropriate [[functional unit]]. If one or more operands are unavailable during the current clock cycle (generally because they
# The instruction is executed by the appropriate functional unit.
# The functional unit writes the results back to the [[register file]].
Line 48 ⟶ 49:
=== Out-of-order processors ===
This new paradigm breaks up the processing of instructions into these steps:<ref>{{Cite journal |last1=González |first1=Antonio |last2=Latorre |first2=Fernando |last3=Magklis |first3=Grigorios |date=2011 |title=Processor Microarchitecture |url=https://link.springer.com/book/10.1007/978-3-031-01729-2 |journal=Synthesis Lectures on Computer Architecture |language=en |doi=10.1007/978-3-031-01729-2 |isbn=978-3-031-00601-2 |issn=1935-3235|url-access=subscription }}</ref>
# Instruction fetch.
# Instruction decoding.
Line 58 ⟶ 59:
# Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The key concept of
The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data The benefit of
== Dispatch and issue decoupling allows out-of-order issue ==
One of the differences created by the new paradigm is the creation of queues that
The [[Instruction cycle|fetch and decode stages]] is separated from the execute stage in a [[Pipeline (computing)|pipelined]] processor by using a [[Data buffer|buffer]]. The buffer's purpose is to partition the [[Memory access pattern|memory access]] and execute functions in a computer program and achieve high
A larger buffer can, in theory, increase throughput. However, if the processor has a [[branch misprediction]] then the entire buffer may need to be flushed, wasting a lot of [[clock cycle]]s and reducing the effectiveness. Furthermore, larger buffers create more heat and use more [[Die (integrated circuit)|die]] space. For this reason processor designers today
▲The buffer's purpose is to partition the [[Memory access pattern|memory access]] and execute functions in a computer program and achieve high-performance by exploiting the fine-grain [[parallel computing|parallelism]] between the two.<ref>{{cite journal |author-last=Smith |author-first1=J. E. |title=Decoupled access/execute computer architectures |journal= ACM Transactions on Computer Systems|date=1984 |volume=2 |issue=4 |pages=289–308 |citeseerx=10.1.1.127.4475 |doi=10.1145/357401.357403|s2cid=13903321 }}</ref> In doing so, it effectively hides all [[memory latency]] from the processor's perspective.
Decoupled architectures are generally thought of as not useful for general
▲A larger buffer can, in theory, increase throughput. However, if the processor has a [[branch misprediction]] then the entire buffer may need to be flushed, wasting a lot of [[clock cycle]]s and reducing the effectiveness. Furthermore, larger buffers create more heat and use more [[Die (integrated circuit)|die]] space. For this reason processor designers today favour a [[multi-threaded]] design approach.
▲Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well.<ref>{{cite journal |author-last1=Kurian |author-first1=L. |author-last2=Hulina |author-first2=P. T. |author-last3=Coraor |author-first3=L. D. |title=Memory latency effects in decoupled architectures |journal=[[IEEE Transactions on Computers]] |volume=43 |issue=10 |date=1994 |pages=1129–1139 |doi=10.1109/12.324539 |s2cid=6913858 |url=https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf |archive-url=https://web.archive.org/web/20180612141055/https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf |url-status=dead |archive-date=2018-06-12 }}</ref> Control intensive code include such things as nested branches that occur frequently in [[operating system]] [[kernel (operating system)|kernels]]. Decoupled architectures play an important role in scheduling in [[very long instruction word]] (VLIW) architectures.<ref>{{cite journal |author-first1=M. N. |author-last1=Dorojevets |author-first2=V. |author-last2=Oklobdzija |title=Multithreaded decoupled architecture |journal=International Journal of High Speed Computing |volume=7 |issue=3 |pages=465–480 |date=1995 |doi=10.1142/S0129053395000257 |url=https://www.researchgate.net/publication/220171480}}</ref>
== Execute and writeback decoupling allows program restart ==
The queue for results is necessary to resolve issues such as branch mispredictions and exceptions
== Micro-architectural choices ==
:[[IBM
▲:[[IBM]] [[PowerPC]] processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the term ''reservation stations'' for their distributed queues.
* Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight.▼
▲
:Early Intel out-of-order processors use a results queue called a [[reorder buffer]],{{efn|Intel [[P6 (microarchitecture)|P6]] family microprocessors have both a reorder buffer (ROB) and a [[register renaming|register alias table]] (RAT). The ROB was motivated mainly by branch misprediction recovery. The Intel [[P6 (microarchitecture)|P6]] family is among the earliest
== See also ==
{{Wikibooks | Microprocessor Design | Out Of Order Execution }}
* [[
* [[Replay system]]
* [[Shelving buffer]]
== Notes ==
{{Notelist}}
== References ==
|