Out-of-order execution: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 23:00, 14 November 2023 edit Jeffreywolberg (talk \| contribs) 4 edits m →Change 'addresses' to 'registers', as that is the more accurate term. Addresses can refer to memory addresses, which is misleading Tag: Visual edit ← Previous edit		Latest revision as of 23:23, 11 August 2025 edit undo Folkezoft (talk \| contribs) Extended confirmed users 46,608 edits m →Precise exceptions: Tag Bare URL PDFs using AutoWikiBrowser Tag: AWB
(42 intermediate revisions by 15 users not shown)
Line 1: {{Short description\|Computing paradigm to improve computational efficiency}} {{Redirect\|OOE\|\|Ooe (disambiguation)}} {{Use American English\|date=January 2025}} In [[computer engineering]], '''out-of-order execution''' (or more formally '''dynamic execution''') is aan [[~~paradigm~~instruction scheduling]] paradigm used in ~~most~~ high-performance [[central processing unit]]s to make use of [[instruction cycle]]s that would otherwise be wasted. In this paradigm, a processor executes [[Instruction (computing)\|instructions]] in an order governed by the availability of input data and execution units,<ref>{{cite book \|author-last=Kukunas \|author-first=Jim \|date=2015 \|title=Power and Performance: Software Analysis and Optimization \|url=https://books.google.com/books?id=X-WcBAAAQBAJ&pg=PA37 \|publisher=Morgan Kaufman \|page=37 \|isbn=9780128008140}}</ref> rather than by their original order in a program.<ref>{{cite web \|url=http://courses.cs.washington.edu/courses/csep548/06au/lectures/introOOO.pdf \|title=Out-of-order execution \|date=2006 \|quote=don't wait for previous instructions to execute if this instruction does not depend on them \|access-date=2014-01-17 \|publisher=cs.washington.edu}}</ref><ref name="Regis High School 2011">{{cite web \| title=The Centennial Celebration \| website=Regis High School \| date=2011-03-14 \| url=https://www.regis.org/2014/multimedia/tomasulo.cfm \| access-date=2022-06-25\|quote=The algorithm "allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially" (also known as out of order execution).}}</ref> In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.<ref>{{cite web \|url=http://www.pcguide.com/ref/cpu/arch/int/featOOE-c.html \|url-status=dead \|archive-url=https://web.archive.org/web/20190218182056/http://www.pcguide.com/ref/cpu/arch/int/featOOE-c.html \|archive-date=2019-02-18 \|quote=This flexibility improves performance since it allows execution with less 'waiting' time. \|title=Out-of-order Execution \|publisher=~~pcguide~~The PC Guide \|first=Charles M.~~com~~ \|last=Kozierok \|date=April 17, 2001 \|access-date=2014-01-17}}</ref> == History == Out-of-order execution is a restricted form of [[dataflow architecture~~\|data flow~~]] ~~computation~~, which was a major research area in [[computer architecture]] in the 1970s and early 1980s. === Early use in supercomputers === ~~The~~Arguably the first machine to use out-of-order execution ~~was~~is the [[CDC 6600]] (1964)~~, designed by [[James E. Thornton]]~~, which ~~uses~~used a [[Scoreboarding\|scoreboard]] to ~~avoid~~resolve conflicts. ItThe ~~permits~~6600 anhowever ~~instruction to execute if its source operand~~lacked [[Hazard_(~~read~~computer_architecture)#Write_after_write_(WAW)\|WAW ~~registers~~conflict]] ~~aren't~~handling, tochoosing ~~be written~~instead to bystall. ~~any~~This ~~unexecuted~~situation ~~earlier~~was ~~instruction~~termed ~~(true~~a ~~dependency)~~"First ~~and~~Order ~~the destination (write) register not be a register used~~Conflict" by ~~any unexecuted earlier instruction (false dependency)~~Thornton.<ref>{{harvtxt\|Thornton\|1970\|p=125}}</ref> ~~The~~ ~~6600~~Whilst ~~lacks~~it ~~the~~had ~~means to avoid stalling an~~both [[~~execution~~RAW ~~unit~~conflict]] ~~on false dependencies~~resolution (~~[[Hazard_(computer_architecture)#Write_after_write_(WAW)\|write~~termed ~~after~~"Second ~~write]]~~Order ~~(WAW~~Conflict"<ref>{{harvtxt\|Thornton\|1970\|p=126}}</ref>) and [[Hazard_(computer_architecture)#Write_after_read_(WAR)\|~~write~~WAR ~~after read~~conflict]] resolution (~~WAR) conflicts, respectively~~ termed ~~''first~~"Third ~~order conflict'' and ''third order conflict'' by~~Order Conflict"<ref>{{harvnb\|Thornton~~, who termed true dependencies ([[Hazard_(computer_architecture)#Read_after_write_(RAW)~~\|~~read after write]] (RAW)~~1970\|p=127}}</ref>) asall ~~second~~of ~~order~~which ~~conflict)~~is ~~because~~sufficient ~~each~~to ~~address has only a single ___location referable by~~declare it. ~~The~~capable ~~WAW~~of isfull ~~worse~~out-of-order ~~than WAR for~~execution, the 6600, ~~because~~did ~~when~~not anhave ~~execution~~precise ~~unit~~exception ~~encounters~~handling. aAn ~~WAR,~~early ~~the~~and ~~other~~limited ~~execution~~form ~~units~~of ~~still~~Branch ~~receive~~prediction ~~and~~was ~~execute~~possible ~~instructions,~~as ~~but~~long ~~upon a WAW~~as the ~~assignment~~branch ~~of instructions~~was to ~~execution~~locations ~~units~~on ~~stops,~~what ~~and~~was ~~they~~termed ~~can~~the ~~not~~"Instruction ~~receive~~Stack" ~~any~~which ~~further~~was ~~instructions~~limited ~~until~~to ~~the~~within ~~WAW-causing~~a ~~instruction's~~depth ~~destination~~of ~~register~~seven ~~has~~words ~~been~~from ~~written to by~~the ~~earlier~~Program ~~instruction~~Counter.<ref>{{~~harvtxt~~harvnb\|Thornton\|1970\|p=~~125-127~~112,123}}</ref> About two years later, the [[IBM System/360 Model 91]] (1966) introduced [[register renaming]] with [[Tomasulo's algorithm]],<ref>{{citation \|title=An Efficient Algorithm for Exploiting Multiple Arithmetic Units \|journal=[[IBM Journal of Research and Development]] \|volume=11 \|issue=1 \|pages=25–33 \|date=1967 \|author-first=Robert Marco \|author-last=Tomasulo \|author-link=Robert Marco Tomasulo \|doi=10.1147/rd.111.0025 \|url=https://pdfs.semanticscholar.org/8299/94a1340e5ecdb7fb24dad2332ccf8de0bb8b.pdf \|archive-url=https://web.archive.org/web/20180612141530/https://pdfs.semanticscholar.org/8299/94a1340e5ecdb7fb24dad2332ccf8de0bb8b.pdf \|url-status=dead \|archive-date=2018-06-12 \|citeseerx=10.1.1.639.7540\|s2cid=8445049 }}</ref> which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register ''r<sub>n</sub>'' can be executed before an earlier instruction using the register ''r<sub>n</sub>'' is executed, by actually writing into an alternative (renamed) register ''alt-r<sub>n</sub>'', which is turned into a normal ~~("architectural")~~ register ''r<sub>n</sub>'' only when all the earlier instructions addressing ''r<sub>n</sub>'' have been executed, but until then ''r<sub>n</sub>'' is given for earlier instructions and ''alt-r<sub>n</sub>'' for later ones addressing ''r<sub>n</sub>''. In the Model 91 the register renaming is implemented by a [[Operand forwarding\|bypass]] termed ''Common Data Bus'' (CDB) and memory source operand buffers, leaving the physical architectural registers unused for many cycles as the oldest state of registers addressed by any unexecuted instruction is found on the CDB. Another advantage the Model 91 has over the 6600 is the ability to execute instructions out-of-order ~~the instructions ''at~~in the same [[execution unit]]'', not just between the units like the 6600{{Disputed inline\|date=July 2025}}. This is accomplished by [[reservation station]]s, from which instructions go to the execution unit when ready, as opposed to the FIFO queue{{Disputed inline\|date=July 2025}} of each execution unit of the 6600. The Model 91 is also capable of ~~re-ordering~~reordering loads and stores to execute before the preceding loads and stores,<ref name=zs1/> unlike the 6600, which only has a limited ability to move loads past loads, and stores past stores, but not loads past stores and stores past loads.<ref>{{harvtxt\|Thornton\|1970\|p=48-50}}</ref> Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point ~~code~~calculations. The 91 and 6600 both also suffer from [[~~Tomasulo's algorithm#Exceptions\|~~imprecise ~~exceptions~~exception]]s, which needed to be solved before out-of-order execution could be applied generally and made practical outside supercomputers. === Precise exceptions === To have [[precise ~~exceptions~~exception]]s, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by [[James E. Smith (engineer)\|James E. Smith]] and Andrew R. Pleszkun.<ref name="smith">{{cite journal \|last1=Smith \|first1=James E. \|last2=Pleszkun \|first2=Andrew R. \|author1-link=James E. Smith (engineer) \|title=Implementation of precise interrupts in pipelined processors \|journal=12th ISCA\|date=June 1985 \|url=https://dl.acm.org/doi/epdf/10.5555/327010.327125}}<br/>(Expanded version published in May 1988 as [https://www.cs.virginia.edu/~evans/greatworks/smith.pdf ''Implementing Precise Interrupts in Pipelined Processors''].)</ref> The [[~~CDC Cyber#Cyber 200 series\|~~CDC Cyber 205]] was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an ''invisible exchange package'', so that it can resume at the same state of execution.<ref>{{cite web \|last1=Moudgill \|first1=Mayan \|last2=Vassiliadis \|first2=Stamatis \|title=On Precise Interrupts \|page=18 \|date=January 1996 \|citeseerx=10.1.1.33.3304 \|url=https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.3304&rep=rep1&type=pdf \|archive-url=https://web.archive.org/web/20221013035408/https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.3304&rep=rep1&type=pdf \|archive-date=13 October 2022 \|format=pdf}}</ref> However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions.<ref name="smith"/> Through simulation, Smith ~~simulated~~determined that adding a reorder buffer (or history buffer or equivalent) to the [[~~Cray-1\|~~Cray-1S]] would reduce the performance of executing the first 14 [[Livermore loops]] (unvectorized) by only 3%.<ref name="smith"/> Important academic research in this subject was led by [[Yale Patt]] with his [[HPSm]] simulator.<ref>{{cite book \|url=http://dl.acm.org/citation.cfm?id=17391 \|title=HPSm, a high performance restricted data flow architecture having minimal functionality \|work=ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture \|isbn=978-0-8186-0719-6 \|pages=297–306 \|date=1986 \|access-date=2013-12-06 \|author-first1=W. \|author-last1=Hwu \|author-first2=Yale N. \|author-last2=Patt \|author-link2=Yale Patt \|publisher=[[Association for Computing Machinery\|ACM]]}}</ref> In the 1980s many early [[~~Reduced instruction set computer\|~~RISC]] microprocessors, ~~like~~had out-of-order writeback to the registers, invariably resulting in imprecise exceptions. The [[Motorola 88100]], ~~had~~was one of the few early microprocessors that did not suffer from imprecise exceptions despite out-of-order ~~writeback~~writes, toalthough ~~the~~it ~~registers,~~did ~~resulting~~allow inboth precise and imprecise floating-point exceptions.<ref>http://www.bitsavers.org/components/motorola/88000/MC88100_RISC_Microprocessor_Users_Manual_2ed_1990.pdf {{Bare URL PDF\|date=August 2025}}</ref> Instructions started execution in order, but some (e.g. floating-point) took more cycles to complete execution. However, the single-cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600. === Decoupling === Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching.<ref>{{cite journal \|last1=Smith \|first1=James E. \|author1-link=James E. Smith (engineer) \|title=Decoupled Access/Execute Computer Architectures \|journal=ACM Transactions on Computer Systems \|date=November 1984 \|volume=2 \|issue=4 \|pages=289–308 \|doi=10.1145/357401.357403 \|s2cid=13903321 \|url=https://course.ece.cmu.edu/~ece447/s15/lib/exe/fetch.php?media=p289-smith.pdf}}</ref> He implemented those ideas in the [[Astronautics Corporation of America\|Astronautics]] ZS-1 (1988), featuring a decoupling of the integer/load/store [[Instruction pipelining\|pipeline]] from the floating-point pipeline, allowing inter-pipeline ~~re-ordering~~reordering. The ZS-1 was also capable of executing loads ahead of preceding stores. In his 1984 paper, he opined that enforcing the precise exceptions only on the integer/memory pipeline should be sufficient for many ~~usecases~~use cases, as it even permits [[virtual memory]]. Each pipeline had an instruction buffer to decouple it from the instruction decoder, to prevent the stalling of the front- end. To further decouple the memory access from execution, each of the two pipelines was associated with two addressable [[FIFO (computing and electronics)\|queues]] that effectively performed limited register renaming.<ref name=zs1>{{cite journal \|last1=Smith \|first1=James E. \|author1-link=James E. Smith (engineer) \|title=Dynamic Instruction Scheduling and the Astronautics ZS-1 \|journal=Computer \|url=https://course.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=00030730.pdf \|pages=21–35 \|doi=10.1109/2.30730 \|date=July 1989 \|volume=22 \|issue=7 \|s2cid=329170 }}</ref> A similar decoupled architecture had been used a bit earlier in the Culler 7.<ref>{{cite web \|last1=Smotherman \|first1=Mark \|title=Culler-7 \|url=https://people.computing.clemson.edu/~mark/culler.html \|website=[[Clemson University]]}}</ref> The ZS-1's ISA, like IBM's subsequent POWER, aided the early execution of branches. === Research comes to fruition === With the [[POWER1]] (1990), IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a ''physical register file'' (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a ~~datafull~~ reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (named ''program counter stack'' by IBM) to undo changes to count, link, and condition registers. The reordering capability of even the floating-point instructions is still very limited; due to POWER1's inability to reorder floating-point arithmetic instructions (results became available in-order), their destination registers aren't renamed. POWER1 also doesn't have [[reservation station]]s needed for out-of-order use of athe same execution unit.<ref>{{cite journal \|last1=Grohoski \|first1=Gregory F. \|title=Machine organization of the IBM RISC System/6000 processor \|journal=[[IBM Journal of Research and Development]] \|date=January 1990 \|volume=34 \|issue=1 \|pages=37–58 \|doi=10.1147/rd.341.0037 \|archive-url=https://web.archive.org/web/20050109191456/http://www.research.ibm.com/journal/rd/341/ibmrd3401F.pdf\|url=http://www.research.ibm.com/journal/rd/341/ibmrd3401F.pdf\|archive-date=January 9, 2005}}</ref><ref>{{cite journal \|last1=Smith \|first1=James E. \|last2=Sohi \|first2=Gurindar S. \|author1-link=James E. Smith (engineer) \|title=The Microarchitecture of Superscalar Processors \|journal=Proceedings of the IEEE \|date=December 1995 \|volume=83 \|issue=12 \|url=https://courses.cs.washington.edu/courses/cse471/01au/ss_cgi.pdf \|page=1617\|doi=10.1109/5.476078 }}</ref> The next year IBM's [[~~IBM System/390\|~~ES/9000]] model 900 had register renaming ~~also~~added for the general-purpose registers. It also has [[reservation station]]s with six entries for the dual integer unit (each cycle, from the six instructions up to two can be selected and then executed) and six entries for the FPU. Other units have simple FIFO queues. The ~~re-ordering~~reordering distance is up to 32 instructions.<ref>{{cite journal\|url=http://www.research.ibm.com/journal/rd/364/ibmrd3604N.pdf\|title=Design of the IBM Enterprise System/9000 high-end processor\|first=John S.\|last=Liptay\|journal=[[IBM Journal of Research and Development]]\|volume=36\|issue=4\|date=July 1992\| pages=713–731 \| doi=10.1147/rd.364.0713 \|archive-url=https://web.archive.org/web/20050117034801/http://www.research.ibm.com/journal/rd/364/ibmrd3604N.pdf\|archive-date=January 17, 2005}}</ref> The A19 of [[Unisys]]' [[Burroughs Large Systems\|A-series of mainframes]] was also released in 1991 and was claimed to have out-of-order execution, and one analyst called the A19's technology three to five years ahead of the competition.<ref>{{cite news \|last1=Ziegler \|first1=Bart \|title=Unisys Unveils 'Top Gun' Mainframe Computers \|url=https://apnews.com/article/fbb84876bd4b60cee52e5c3622ea0d13 \|work=AP News \|date=March 7, 1991}}</ref><ref>{{cite news \|title=Unisys' New Mainframe Leaves Big Blue In The Dust \|url=https://www.bloomberg.com/news/articles/1991-03-24/unisys-new-mainframe-leaves-big-blue-in-the-dust \|work=Bloomberg \|date=March 25, 1991 \|quote=The new A19 relies on "super-scalar" techniques from scientific computers to execute many instructions concurrently. The A19 can overlap as many as 140 operations, more than 10 times as many as conventional mainframes can.}}</ref> === Wide adoption === The first [[~~Superscalar processor\|~~superscalar]] [[Microprocessor\|single-chip processors]] ([[Intel ~~i960~~i960CA]]CA in 1989) used a simple scoreboarding scheduling like the CDC 6600 had a quarter of a century earlier,. ~~but~~In ~~in 1992-1996~~1992–1996 a rapid advancement of techniques, enabled by [[Moore's law\|increasing transistor counts]], saw proliferation down to [[personal ~~computers~~computer]]s. The [[Motorola 88110]] (1992) used a history buffer to revert instructions.<ref>{{cite journal \|last1=Ullah \|first1=Nasr \|last2=Holle \|first2=Matt \|title=The MC88110 Implementation of Precise Exceptions in a Superscalar Architecture \|journal=ACM ~~Sigarch~~SIGARCH Computer Architecture News \|url=https://dl.acm.org/doi/pdf/10.1145/152479.152482 \|publisher=Motorola Inc. \|format=pdf \|date=March 1993\|volume=21 \|pages=15–25 \|doi=10.1145/152479.152482 \|s2cid=7036627 \|url-access=subscription }}</ref> Loads could be executed ahead of preceding stores. While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all the pipeline stages, including writeback. The 12-entry capacity of the history buffer placed a limit on the reorder distance.<ref>{{cite web \|last1=Smotherman \|first1=Mark \|title=Motorola MC88110 Overview \|url=http://www.m88k.com/orig/misc/msmotherman-88110.txt \|date=29 April 1994}}</ref><ref>{{cite journal \|last1=Diefendorff \|first1=Keith \|author1-link=Keith Diefendorff \|last2=Allen \|first2=Michael \|title=Organization of the Motorola 88110 superscalar RISC microprocessor \|journal=IEEE Micro \|date=April 1992 \|volume=12 \|issue=2 \|pages=40–63 \|doi=10.1109/40.127582 \|s2cid=25668727 \|url=http://cjat.ir/images/PDF_English/20143.pdf \|archive-url=https://web.archive.org/web/20221021015941/http://cjat.ir/images/PDF_English/20143.pdf \|archive-date=2022-10-21 }}</ref><ref>{{cite book \|last1=Smotherman \|first1=Mark \|last2=Chawla \|first2=Shuchi \|last3=Cox \|first3=Stan \|last4=Malloy \|first4=Brian \|title=Proceedings of the 26th Annual International Symposium on Microarchitecture \|chapter=Instruction scheduling for the Motorola 88110 \|date=December 1993 \|pages=257–262 \|doi=10.1109/MICRO.1993.282761 \|isbn=0-8186-5280-2 \|s2cid=52806289 \|chapter-url=https://dl.acm.org/doi/epdf/10.5555/255235.255299}}</ref> The [[~~PowerPC_600#PowerPC_601\|~~PowerPC 601]] (1993) was an evolution of the [[RISC Single Chip]], itself a simplification of POWER1. The 601 permitted branch and floating-point instructions to overtake the integer instructions already in the fetched- instruction- queue, the lowest four entries of which were scanned for dispatchability. In the case of a cache miss, loads and stores could be reordered. Only the link and count ~~register~~registers could be renamed.{{Refn\|<ref>{{cite web \|title=PowerPC™ 601 RISC Microprocessor Technical Summary \|url=https://www.nxp.com/docs/en/data-sheet/MPC601.pdf \|access-date=23 October 2022}}</ref><ref>[[Charles R. Moore (computer engineer)\|Moore, Charles R.]]; Becker, Michael C. et al. {{cite journal \|title=The PowerPC 601 microprocessor \|journal=IEEE Micro \|date=September 1993 \|volume=13 \|issue=5 \|url=https://www.researchgate.net/publication/3214696}}</ref><ref>{{cite web \|last1=Diefendorff \|first1=Keith \|author1-link=Keith Diefendorff \|title=PowerPC 601 Microprocessor \|url=https://old.hotchips.org/wp-content/uploads/hc_archives/hc05/3_Tue/HC05.S8/HC05.8.2-Diefendorff-Motorola-PowerPC601.pdf \|publisher=[[Hot Chips]] \|date=August 1993}}</ref><ref>{{cite journal \|last1=Smith \|first1=James E. \|last2=Weiss \|first2=Shlomo \|author1-link=James E. Smith (engineer) \|title=PowerPC 601 and Alpha 21064: A Tale of Two RISCs \|journal=IEEE Computer \|date=June 1994 \|volume=27 \|issue=6 \|pages=46–58 \|doi=10.1109/2.294853 \|s2cid=1114841 \|url=https://www.eecg.utoronto.ca/~moshovos/ACA05/read/ppc601and21064.pdf}}</ref><ref>{{cite journal \|last1=Sima \|first1=Dezsö \|title=The design space of register renaming techniques \|url=https://www.researchgate.net/publication/3215151 \|journal=IEEE Micro \|date=September–October 2000 \|volume=20 \|issue=5 \|pages=70–83 \|doi=10.1109/40.877952 \|citeseerx=10.1.1.387.6460 \|s2cid=11012472 }}</ref>}} In the fall of 1994 [[NexGen]] and [[AIM alliance\|IBM with Motorola]] brought the renaming of general-purpose registers to single-chip CPUs. NexGen's Nx586 was the first [[x86]] processor capable of out-of-order execution, ~~accomplished~~and ~~with~~featured ~~[[micro-operation\|micro-OPs]].~~a ~~The re-ordering~~reordering distance isof up to 14 [[micro-~~OPs~~operation]]s.<ref>{{cite web \|last1=Gwennap \|first1=Linley \|title=NexGen Enters Market with 66-MHz Nx586 \|url=https://www.ardent-tool.com/CPU/docs/MPR/080403.pdf \|website=[[Microprocessor Report]] \|archive-url=https://web.archive.org/web/20211202223054/https://www.ardent-tool.com/CPU/docs/MPR/080403.pdf \|archive-date=2 December 2021 \|date=28 March 1994}}</ref> The [[~~PowerPC_600#PowerPC_603\|~~PowerPC 603]] renamed both the general-purpose and FP registers. Each of the four non-branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units. A five-entry [[~~re-order~~reorder buffer]] lets no more than four instructions to overtake an unexecuted instruction. Due to a store buffer, a load can access cache ahead of a preceding store.<ref>{{cite journal \|last1=Burgess \|first1=Brad \|last2=Ullah \|first2=Nasr \|last3=Van Overen \|first3=Peter \|last4=Ogden \|first4=Deene \|title=The PowerPC 603 microprocessor \|journal=Communications of the ACM \|date=June 1994 \|volume=37 \|issue=6 \|pages=34–42 \|doi=10.1145/175208.175212 \|s2cid=34385975 \|doi-access=free }}</ref><ref>{{cite web \|title=PowerPC™ 603 RISC Microprocessor Technical Summary \|url=https://www.nxp.com/docs/en/data-sheet/MPC603.pdf \|access-date=27 October 2022}}</ref> [[~~PowerPC_600#PowerPC_604\|~~PowerPC 604]] (1995) was the first single-chip processor with [[execution unit]]-level ~~re-ordering~~reordering, as three out of its six units each had a two-entry reservation station permitting the newer entry to execute before the older. The ~~re-order~~reorder buffer capacity is 16 instructions. A four-entry load queue and a six-entry store queue track the ~~re-ordering~~reordering of loads and stores upon cache misses.<ref>{{cite journal \|last1=Song \|first1=S. Peter \|last2=Denman \|first2=Marvin \|last3=Chang \|first3=Joe \|title=The PowerPC 604 RISC microprocessor \|journal=IEEE Micro \|date=October 1994 \|volume=14 \|issue=5 \|page=8 \|doi=10.1109/MM.1994.363071 \|s2cid=11603864 \|url=https://www.complang.tuwien.ac.at/andi/tuonly/SkriptPPC604.pdf}}</ref> [[HAL SPARC64]] (1995) exceeded the ~~re-ordering~~reordering capacity of the [[~~IBM System/390\|~~ES/9000]] model 900 by having three 8-entry reservation stations for integer, floating-point, and [[address generation unit]], and a 12-entry reservation station for load/store, which permits greater reordering of cache/memory access than preceding processors. Up to 64 instructions can be in a ~~re-ordered~~reordered state at a time.<ref>{{cite web \|title=SPARC64+: HAL's Second Generation 64-bit SPARC Processor \|url=https://old.hotchips.org/wp-content/uploads/hc_archives/hc07/2_Mon/HC7.S3/HC7.3.2.pdf \|website=[[Hot Chips]]}}</ref><ref>{{cite web \|url=https://www.irisa.fr/caps/projects/TechnologicalSurvey/micro/PI-957-html/section2_8_7.html \|website=[[Research Institute of Computer Science and Random Systems]] \|title=Le Sparc64 \|language=French}}</ref> [[Pentium Pro]] (1995) introduced a ''[[reservation station\|unified reservation station]]'', which at the 20 micro-OP capacity permitted very flexible ~~re-ordering~~reordering, backed by a 40-entry ~~re-order~~reorder buffer. Loads can be ~~re-ordered~~reordered ahead of both loads and stores.<ref>{{cite web \|last1=Gwennap \|first1=Linley \|title=Intel's P6 Uses Decoupled Superscalar Design \|url=http://www.cs.cmu.edu/afs/cs/academic/class/15213-f01/docs/mpr-p6.pdf \|website=[[Microprocessor Report]] \|date=16 February 1995}}</ref> The practically attainable [[instructions per cycle\|per-cycle rate of execution]] rose ~~more~~further as full out-of-order execution was further adopted by [[Silicon Graphics\|SGI]]/[[MIPS Technologies\|MIPS]] ([[R10000]]) and [[Hewlett-Packard\|HP]] [[PA-RISC]] ([[PA-8000]]) in 1996. The same year [[Cyrix 6x86]] and [[AMD K5]] brought advanced ~~re-ordering~~reordering techniques into mainstream [[personal ~~computer]]s~~computers. Since [[DEC Alpha]] gained out-of-order execution in 1998 ([[Alpha 21264]]), the top-performing out-of-order processor cores have been unmatched by in-order cores other than [[Hewlett-Packard\|HP]]/[[Intel]] [[Itanium 2]] 2 and [[IBM]] [[POWER6]], though the latter had an out-of-order [[floating-point unit]].<ref>Le, Hung Q. et al. {{cite journal \|title=IBM POWER6 microarchitecture \|journal=IBM Journal of Research and Development \|date=November 2007 \|volume=51 \|issue=6 \|url=https://course.ece.cmu.edu/~ece742/f12/lib/exe/fetch.php?media=le_power6.pdf}}</ref> The other high-end in-order processors fell far behind, namely [[Sun Microsystems\|Sun]]'s [[UltraSPARC III]]/[[UltraSPARC IV\|IV]], and IBM's [[~~Mainframe computer\|mainframes~~mainframe]]s which had lost the out-of-order execution capability for the second time, remaining in-order into the [[IBM z10\|z10]] generation. Later big in-order processors were focused on multithreaded performance, but eventually the [[SPARC T series]] and [[Xeon Phi]] changed to out-of-order execution in 2011 and 2016 respectively.{{cn\|reason=No mention of out-of-order execution in either linked article.\|date=July 2024}} Almost all processors for phones and other lower-end applications remained in-order until c. {{circa\|2010}}. First, [[Qualcomm]]'s [[Scorpion (processor)\|Scorpion]] (~~re-ordering~~reordering distance of 32) shipped in [[Qualcomm Snapdragon\|Snapdragon]],<ref>{{cite web \|last1=Mallia \|first1=Lou \|title=Qualcomm High Performance Processor Core and Platform for Mobile Applications \|url=http://rtcgroup.com/arm/2007/presentations/253%20-%20ARM_DevCon_2007_Snapdragon_FINAL_20071004.pdf \|archive-url=https://web.archive.org/web/20131029193001/http://rtcgroup.com/arm/2007/presentations/253%20-%20ARM_DevCon_2007_Snapdragon_FINAL_20071004.pdf \|archive-date=29 October 2013}}</ref> and a bit later [[Arm (company)\|Arm]]'s [[ARM Cortex-A9\|A9]] succeeded [[ARM Cortex-A8\|A8]]. For low-end [[x86]] [[personal computer]]s in-order [[Bonnell (microarchitecture)\|]] in~~-order~~ early [[Intel ~~Atoms~~Atom]] processors were first challenged by [[~~Advanced Micro Devices\|~~AMD]]'s [[Bobcat (microarchitecture~~)\|Bobcat~~]], and in 2013 were succeeded by an out-of-order [[Silvermont microarchitecture]].<ref>{{cite web \|url=http://www.anandtech.com/show/6936/intels-silvermont-architecture-revealed-getting-serious-about-mobile/2 \|archive-url=https://archive.today/20161222023104/http://www.anandtech.com/show/6936/intels-silvermont-architecture-revealed-getting-serious-about-mobile/2 \|url-status=dead \|archive-date=December 22, 2016 \|website=AnandTech \|title=Intel's Silvermont Architecture Revealed: Getting Serious About Mobile \|author=Anand Lal Shimpi \|date=2013-05-06}}</ref> Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution is still prevalent in [[microcontroller]]s and [[embedded system]]s, as well as in phone-class cores such as Arm's [[ARM Cortex-A55\|A55]] and [[ARM Cortex-A510\|A510]] in [[~~ARM big.LITTLE\|~~big.LITTLE]] configurations. == Basic concept == === Background === ~~To appreciate out~~Out-of-order ~~Execution it~~execution is ~~useful~~more sophisticated relative to ~~first~~the ~~describe~~baseline of in-order, toexecution. beIn ~~able~~pipelined toin-order ~~make~~execution aprocessors, ~~comparison~~execution of ~~the~~instructions ~~two.~~overlap ~~Instructions~~in ~~cannot~~pipelined befashion ~~completed~~with ~~instantaneously:~~each ~~they~~requiring ~~take~~multiple ~~time~~[[clock ~~(multiple~~cycle]]s ~~cycles)~~to complete. ~~Therefore,~~The consequence is that results from a previous instruction will lag behind where they ~~are~~may be needed in the next. In-order execution still has to keep track of ~~the~~these dependencies. Its approach is however quite unsophisticated: stall, every time. ~~out~~Out-of-order uses much more sophisticated data tracking techniques, as ~~seen~~described below. === In-order processors === In earlier processors, the processing of instructions is performed in an [[instruction cycle]] normally consisting of the following steps: # [[Instruction (computer science)\|Instruction]] ~~[[Fetch-execute cycle\|~~fetch]]. # If input [[operand]]s are available (in processor registers, for instance), the instruction is dispatched to the appropriate [[functional unit]]. If one or more operands are unavailable during the current clock cycle (generally because they ~~are~~must ~~being~~be fetched from [[Computer memory\|memory]]), the processor stalls until they are available. # The instruction is executed by the appropriate functional unit. # The functional unit writes the results back to the [[register file]]. Line 45 ⟶ 49: === Out-of-order processors === This new paradigm breaks up the processing of instructions into these steps:<ref>{{Cite journal \|last1=González \|first1=Antonio \|last2=Latorre \|first2=Fernando \|last3=Magklis \|first3=Grigorios \|date=2011 \|title=Processor Microarchitecture \|url=https://link.springer.com/book/10.1007/978-3-031-01729-2 \|journal=Synthesis Lectures on Computer Architecture \|language=en \|doi=10.1007/978-3-031-01729-2 \|isbn=978-3-031-00601-2 \|issn=1935-3235\|url-access=subscription }}</ref> ~~This new paradigm breaks up the processing of instructions into these steps:~~ # Instruction fetch. # Instruction decoding. Line 55 ⟶ 59: # Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage. The key concept of ~~OoOE~~out-of-order processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the ~~OoOE~~ processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data. ~~OoOE~~Out-of-order processors fill these ''slots'' in time with other instructions that ''are'' ready, then ~~re-order~~either reorder the results at the end to make it appear that the instructions were processed as normal., record and thus apply the original ''program order'', or commit in uninterruptible batches where order will not cause data corruption. The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data~~, operands,~~ ~~become~~becomes available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output~~; the processor itself runs the instructions in seemingly random order~~. The benefit of ~~OoOE~~out-of-order processing grows as the [[instruction pipeline]] deepens and the speed difference between [[main memory]] (or [[cache memory]]) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have theoretically processed a large number of instructions. == Dispatch and issue decoupling allows out-of-order issue == One of the differences created by the new paradigm is the creation of queues that ~~allows~~allow the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was ''decoupled architecture''. In the earlier ''in-order'' processors, these stages operated in a fairly [[Lockstep (computing)\|lock-step]], pipelined fashion. The [[Instruction cycle\|fetch and decode stages]] is separated from the execute stage in a [[Pipeline (computing)\|pipelined]] processor by using a [[Data buffer\|buffer]]. The buffer's purpose is to partition the [[Memory access pattern\|memory access]] and execute functions in a computer program and achieve high- performance by exploiting the fine-grain [[parallel computing\|parallelism]] between the two.<ref>{{cite journal \|author-last=Smith \|author-first1=J. E. \|title=Decoupled access/execute computer architectures \|journal= ACM Transactions on Computer Systems\|date=1984 \|volume=2 \|issue=4 \|pages=289–308 \|citeseerx=10.1.1.127.4475 \|doi=10.1145/357401.357403\|s2cid=13903321 }}</ref> In doing so, it effectively hides all [[memory latency]] from the processor's perspective.▼ The instructions of the program may not be run in the originally specified order, as long as the end result is correct. It separates the [[Instruction cycle\|fetch and decode stages]] from the execute stage in a [[Pipeline (computing)\|pipelined]] processor by using a [[Data buffer\|buffer]]. A larger buffer can, in theory, increase throughput. However, if the processor has a [[branch misprediction]] then the entire buffer may need to be flushed, wasting a lot of [[clock cycle]]s and reducing the effectiveness. Furthermore, larger buffers create more heat and use more [[Die (integrated circuit)\|die]] space. For this reason processor designers today ~~favour~~favor a [[multi-threaded]] design approach.▼ ▲The buffer's purpose is to partition the [[Memory access pattern\|memory access]] and execute functions in a computer program and achieve high-performance by exploiting the fine-grain [[parallel computing\|parallelism]] between the two.<ref>{{cite journal \|author-last=Smith \|author-first1=J. E. \|title=Decoupled access/execute computer architectures \|journal= ACM Transactions on Computer Systems\|date=1984 \|volume=2 \|issue=4 \|pages=289–308 \|citeseerx=10.1.1.127.4475 \|doi=10.1145/357401.357403\|s2cid=13903321 }}</ref> In doing so, it effectively hides all [[memory latency]] from the processor's perspective. Decoupled architectures are generally thought of as not useful for general -purpose computing as they do not handle control -intensive code well.<ref>{{cite journal \|author-last1=Kurian \|author-first1=L. \|author-last2=Hulina \|author-first2=P. T. \|author-last3=Coraor \|author-first3=L. D. \|title=Memory latency effects in decoupled architectures \|journal=[[IEEE Transactions on Computers]] \|volume=43 \|issue=10 \|date=1994 \|pages=1129–1139 \|doi=10.1109/12.324539 \|s2cid=6913858 \|url=https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf \|archive-url=https://web.archive.org/web/20180612141055/https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf \|url-status=dead \|archive-date=2018-06-12 }}</ref> Control intensive code include such things as nested branches that occur frequently in [[operating system]] [[kernel ~~(operating system)\|kernels~~]]s. Decoupled architectures play an important role in scheduling in [[very long instruction word]] (VLIW) architectures.<ref>{{cite journal \|author-first1=M. N. \|author-last1=Dorojevets \|author-first2=V. \|author-last2=Oklobdzija \|title=Multithreaded decoupled architecture \|journal=International Journal of High Speed Computing \|volume=7 \|issue=3 \|pages=465–480 \|date=1995 \|doi=10.1142/S0129053395000257 \|url=https://www.researchgate.net/publication/220171480}}</ref>▼ ▲A larger buffer can, in theory, increase throughput. However, if the processor has a [[branch misprediction]] then the entire buffer may need to be flushed, wasting a lot of [[clock cycle]]s and reducing the effectiveness. Furthermore, larger buffers create more heat and use more [[Die (integrated circuit)\|die]] space. For this reason processor designers today favour a [[multi-threaded]] design approach. ▲Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well.<ref>{{cite journal \|author-last1=Kurian \|author-first1=L. \|author-last2=Hulina \|author-first2=P. T. \|author-last3=Coraor \|author-first3=L. D. \|title=Memory latency effects in decoupled architectures \|journal=[[IEEE Transactions on Computers]] \|volume=43 \|issue=10 \|date=1994 \|pages=1129–1139 \|doi=10.1109/12.324539 \|s2cid=6913858 \|url=https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf \|archive-url=https://web.archive.org/web/20180612141055/https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf \|url-status=dead \|archive-date=2018-06-12 }}</ref> Control intensive code include such things as nested branches that occur frequently in [[operating system]] [[kernel (operating system)\|kernels]]. Decoupled architectures play an important role in scheduling in [[very long instruction word]] (VLIW) architectures.<ref>{{cite journal \|author-first1=M. N. \|author-last1=Dorojevets \|author-first2=V. \|author-last2=Oklobdzija \|title=Multithreaded decoupled architecture \|journal=International Journal of High Speed Computing \|volume=7 \|issue=3 \|pages=465–480 \|date=1995 \|doi=10.1142/S0129053395000257 \|url=https://www.researchgate.net/publication/220171480}}</ref> To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called [[register renaming]] is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time. == Execute and writeback decoupling allows program restart == The queue for results is necessary to resolve issues such as branch mispredictions and exceptions~~/traps~~. The results queue allows programs to be restarted after an exception, ~~which~~and ~~requires~~for the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions. The ability to issue instructions past branches that have yet to be resolved is known as [[speculative execution]]. ~~The ability to issue instructions past branches that are yet to resolve is known as [[speculative execution]].~~ == Micro-architectural choices == * Are the instructions dispatched to a centralized queue or to multiple distributed queues? :[[IBM]] [[PowerPC]] processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the term ''reservation stations'' for their distributed queues.▼ * Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight.▼ ▲:[[IBM]] [[PowerPC]] processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the term ''reservation stations'' for their distributed queues. :Early Intel out-of-order processors use a results queue called a [[reorder buffer]],{{efn\|Intel [[P6 (microarchitecture)\|P6]] family microprocessors have both a reorder buffer (ROB) and a [[register renaming\|register alias table]] (RAT). The ROB was motivated mainly by branch misprediction recovery. The Intel [[P6 (microarchitecture)\|P6]] family is among the earliest ~~OoOE~~out-of-order microprocessors but were supplanted by the [[NetBurst]] architecture. Years later, NetBurst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies. Materials were not able to match the design's ambitious clock targets due to thermal issues and later designs based on NetBurst, namely Tejas and Jayhawk, were cancelled. Intel reverted to the P6 design as the basis of the [[Intel Core (microarchitecture)\|Core]] and [[Nehalem (microarchitecture)\|Nehalem]] microarchitectures.}} while most later out-of-order processors use register maps.{{efn\|The succeeding [[Sandy Bridge]], [[Ivy Bridge (microarchitecture)\|Ivy Bridge]], and [[Haswell (microarchitecture)\|Haswell]] microarchitectures are a departure from the reordering techniques used in P6 and employ ~~re-ordering~~reordering techniques from the [[Alpha 21264\|EV6]] and the [[Pentium 4\|P4]] but with a somewhat shorter pipeline.<ref>{{cite web \|author-last=Kanter \|author-first=David \|date=2010-09-25 \|title=Intel's Sandy Bridge Microarchitecture \|url=http://www.realworldtech.com/sandy-bridge/10/}}</ref><ref name="urlThe Haswell Front End - Intels Haswell Architecture Analyzed: Building a New PC and a New Intel">{{cite web \|url=https://www.anandtech.com/show/6355/intels-haswell-architecture/6 \|archive-url=https://web.archive.org/web/20121007163104/http://www.anandtech.com/show/6355/intels-haswell-architecture/6 \|url-status=dead \|archive-date=October 7, 2012 \|title=The Haswell Front End - Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel }}</ref>}}▼ ▲* Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight. ~~:Early Intel out-of-order processors use a results queue called a [[re-order buffer]], while most later out-of-order processors use register maps.~~ :More precisely: Intel [[P6 (microarchitecture)\|P6]] family microprocessors have both a re-order buffer (ROB) and a [[register renaming\|register alias table]] (RAT). The ROB was motivated mainly by branch misprediction recovery. ▲:The Intel [[P6 (microarchitecture)\|P6]] family is among the earliest OoOE microprocessors but were supplanted by the [[NetBurst]] architecture. Years later, NetBurst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies. Materials were not able to match the design's ambitious clock targets due to thermal issues and later designs based on NetBurst, namely Tejas and Jayhawk, were cancelled. Intel reverted to the P6 design as the basis of the [[Intel Core (microarchitecture)\|Core]] and [[Nehalem (microarchitecture)\|Nehalem]] microarchitectures. The succeeding [[Sandy Bridge]], [[Ivy Bridge (microarchitecture)\|Ivy Bridge]], and [[Haswell (microarchitecture)\|Haswell]] microarchitectures are a departure from the reordering techniques used in P6 and employ re-ordering techniques from the [[Alpha 21264\|EV6]] and the [[Pentium 4\|P4]] but with a somewhat shorter pipeline.<ref>{{cite web \|author-last=Kanter \|author-first=David \|date=2010-09-25 \|title=Intel's Sandy Bridge Microarchitecture \|url=http://www.realworldtech.com/sandy-bridge/10/}}</ref><ref name="urlThe Haswell Front End - Intels Haswell Architecture Analyzed: Building a New PC and a New Intel">{{cite web \|url=https://www.anandtech.com/show/6355/intels-haswell-architecture/6 \|title=The Haswell Front End - Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel }}</ref> == See also == {{Wikibooks \| Microprocessor Design \| Out Of Order Execution }} * [[~~Instruction~~Memory ~~scheduling~~barrier]] * [[Memory fence]] * [[Replay system]] * [[Shelving buffer]] == Notes == {{Notelist}} == References ==