Out-of-order execution: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 04:27, 26 June 2025 edit Citation bot (talk \| contribs) Bots 5,869,758 edits Altered journal. Add: isbn, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| #UCB_toolbar ← Previous edit		Latest revision as of 23:23, 11 August 2025 edit undo Folkezoft (talk \| contribs) Extended confirmed users 46,605 edits m →Precise exceptions: Tag Bare URL PDFs using AutoWikiBrowser Tag: AWB
(11 intermediate revisions by 5 users not shown)
Line 3: {{Use American English\|date=January 2025}} In [[computer engineering]], '''out-of-order execution''' (or more formally '''dynamic execution''') is an [[instruction scheduling]] paradigm used in high-performance [[central processing unit]]s to make use of [[instruction cycle]]s that would otherwise be wasted. In this paradigm, a processor executes [[Instruction (computing)\|instructions]] in an order governed by the availability of input data and execution units,<ref>{{cite book \|author-last=Kukunas \|author-first=Jim \|date=2015 \|title=Power and Performance: Software Analysis and Optimization \|url=https://books.google.com/books?id=X-WcBAAAQBAJ&pg=PA37 \|publisher=Morgan Kaufman \|page=37 \|isbn=9780128008140}}</ref> rather than by their original order in a program.<ref>{{cite web \|url=http://courses.cs.washington.edu/courses/csep548/06au/lectures/introOOO.pdf \|title=Out-of-order execution \|date=2006 \|quote=don't wait for previous instructions to execute if this instruction does not depend on them \|access-date=2014-01-17 \|publisher=cs.washington.edu}}</ref><ref name="Regis High School 2011">{{cite web \| title=The Centennial Celebration \| website=Regis High School \| date=2011-03-14 \| url=https://www.regis.org/2014/multimedia/tomasulo.cfm \| access-date=2022-06-25\|quote=The algorithm "allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially" (also known as out of order execution).}}</ref> In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.<ref>{{cite web \|url=http://www.pcguide.com/ref/cpu/arch/int/featOOE-c.html \|url-status=dead \|archive-url=https://web.archive.org/web/20190218182056/http://www.pcguide.com/ref/cpu/arch/int/featOOE-c.html \|archive-date=2019-02-18 \|quote=This flexibility improves performance since it allows execution with less 'waiting' time. \|title=Out-of-order Execution \|publisher=~~pcguide~~The PC Guide \|first=Charles M.~~com~~ \|last=Kozierok \|date=April 17, 2001 \|access-date=2014-01-17}}</ref> == History == Line 9: === Early use in supercomputers === ~~The~~Arguably the first machine to use out-of-order execution ~~was~~is the [[CDC 6600]] (1964)~~, designed by [[James E. Thornton]]~~, which ~~uses~~used a [[Scoreboarding\|scoreboard]] to ~~avoid~~resolve conflicts. ItThe ~~permits~~6600 anhowever ~~instruction to execute if its source operand~~lacked [[Hazard_(~~read~~computer_architecture)#Write_after_write_(WAW)\|WAW ~~registers~~conflict]] ~~aren't~~handling, tochoosing ~~be written~~instead to bystall. ~~any~~This ~~unexecuted~~situation ~~earlier~~was ~~instruction~~termed ~~(true~~a ~~dependency)~~"First ~~and~~Order ~~the destination (write) register not be a register used~~Conflict" by ~~any unexecuted earlier instruction (false dependency)~~Thornton.<ref>{{harvtxt\|Thornton\|1970\|p=125}}</ref> ~~The~~ ~~6600~~Whilst ~~lacks~~it ~~the~~had ~~means to avoid~~both [[~~Pipeline~~RAW ~~stall\|stalling~~conflict]] anresolution ~~[[execution~~(termed ~~unit]]~~"Second onOrder ~~false~~Conflict"<ref>{{harvtxt\|Thornton\|1970\|p=126}}</ref>) ~~dependencies~~and ([[Hazard_(computer_architecture)#~~Write_after_write_~~Write_after_read_(~~WAW~~WAR)\|~~write~~WAR ~~after write~~conflict]] ~~(WAW) and [[write after read]]~~resolution (~~WAR) conflicts, respectively~~ termed ~~''first-order~~"Third ~~conflict'' and ''third-order conflict'' by~~Order Conflict"<ref>{{harvnb\|Thornton~~, who termed true dependencies ([[Hazard_(computer_architecture)#Read_after_write_(RAW)~~\|~~read after write]] (RAW)~~1970\|p=127}}</ref>) asall ~~second-order~~of ~~conflict)~~which ~~because~~is ~~each~~sufficient ~~address~~to ~~has only a single ___location referable by~~declare it. ~~The~~capable ~~WAW~~of isfull ~~worse~~out-of-order ~~than WAR for~~execution, the 6600, ~~because~~did ~~when~~not anhave ~~execution~~precise ~~unit~~exception ~~encounters~~handling. aAn ~~WAR,~~early ~~the~~and ~~other~~limited ~~execution~~form ~~units~~of ~~still~~Branch ~~receive~~prediction ~~and~~was ~~execute~~possible ~~instructions,~~as ~~but~~long ~~upon a WAW~~as the ~~assignment~~branch ~~of instructions~~was to ~~execution~~locations ~~units~~on ~~stops,~~what ~~and~~was ~~they~~termed ~~can~~the ~~not~~"Instruction ~~receive~~Stack" ~~any~~which ~~further~~was ~~instructions~~limited ~~until~~to ~~the~~within ~~WAW-causing~~a ~~instruction's~~depth ~~destination~~of ~~register~~seven ~~has~~words ~~been~~from ~~written to by~~the ~~earlier~~Program ~~instruction~~Counter.<ref>{{~~harvtxt~~harvnb\|Thornton\|1970\|p=~~125-127~~112,123}}</ref> About two years later, the [[IBM System/360 Model 91]] (1966) introduced [[register renaming]] with [[Tomasulo's algorithm]],<ref>{{citation \|title=An Efficient Algorithm for Exploiting Multiple Arithmetic Units \|journal=[[IBM Journal of Research and Development]] \|volume=11 \|issue=1 \|pages=25–33 \|date=1967 \|author-first=Robert Marco \|author-last=Tomasulo \|author-link=Robert Marco Tomasulo \|doi=10.1147/rd.111.0025 \|url=https://pdfs.semanticscholar.org/8299/94a1340e5ecdb7fb24dad2332ccf8de0bb8b.pdf \|archive-url=https://web.archive.org/web/20180612141530/https://pdfs.semanticscholar.org/8299/94a1340e5ecdb7fb24dad2332ccf8de0bb8b.pdf \|url-status=dead \|archive-date=2018-06-12 \|citeseerx=10.1.1.639.7540\|s2cid=8445049 }}</ref> which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register ''r<sub>n</sub>'' can be executed before an earlier instruction using the register ''r<sub>n</sub>'' is executed, by actually writing into an alternative (renamed) register ''alt-r<sub>n</sub>'', which is turned into a normal register ''r<sub>n</sub>'' only when all the earlier instructions addressing ''r<sub>n</sub>'' have been executed, but until then ''r<sub>n</sub>'' is given for earlier instructions and ''alt-r<sub>n</sub>'' for later ones addressing ''r<sub>n</sub>''. In the Model 91 the register renaming is implemented by a [[Operand forwarding\|bypass]] termed ''Common Data Bus'' (CDB) and memory source operand buffers, leaving the physical architectural registers unused for many cycles as the oldest state of registers addressed by any unexecuted instruction is found on the CDB. Another advantage the Model 91 has over the 6600 is the ability to execute instructions out-of-order in the same [[execution unit]], not just between the units like the 6600{{Disputed inline\|date=July 2025}}. This is accomplished by [[reservation station]]s, from which instructions go to the execution unit when ready, as opposed to the FIFO queue{{Disputed inline\|date=July 2025}} of each execution unit of the 6600. The Model 91 is also capable of reordering loads and stores to execute before the preceding loads and stores,<ref name=zs1/> unlike the 6600, which only has a limited ability to move loads past loads, and stores past stores, but not loads past stores and stores past loads.<ref>{{harvtxt\|Thornton\|1970\|p=48-50}}</ref> Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point calculations. The 91 and 6600 both also suffer from [[imprecise exception]]s, which needed to be solved before out-of-order execution could be applied generally and made practical outside supercomputers. === Precise exceptions === To have [[precise exception]]s, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by [[James E. Smith (engineer)\|James E. Smith]] and Andrew R. Pleszkun.<ref name="smith">{{cite journal \|last1=Smith \|first1=James E. \|last2=Pleszkun \|first2=Andrew R. \|author1-link=James E. Smith (engineer) \|title=Implementation of precise interrupts in pipelined processors \|journal=12th ISCA\|date=June 1985 \|url=https://dl.acm.org/doi/epdf/10.5555/327010.327125}}<br/>(Expanded version published in May 1988 as [https://www.cs.virginia.edu/~evans/greatworks/smith.pdf ''Implementing Precise Interrupts in Pipelined Processors''].)</ref> The [[CDC Cyber 205]] was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an ''invisible exchange package'', so that it can resume at the same state of execution.<ref>{{cite web \|last1=Moudgill \|first1=Mayan \|last2=Vassiliadis \|first2=Stamatis \|title=On Precise Interrupts \|page=18 \|date=January 1996 \|citeseerx=10.1.1.33.3304 \|url=https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.3304&rep=rep1&type=pdf \|archive-url=https://web.archive.org/web/20221013035408/https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.3304&rep=rep1&type=pdf \|archive-date=13 October 2022 \|format=pdf}}</ref> However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions.<ref name="smith"/> Through simulation, Smith determined that adding a reorder buffer (or history buffer or equivalent) to the [[Cray-1S]] would reduce the performance of executing the first 14 [[Livermore loops]] (unvectorized) by only 3%.<ref name="smith"/> Important academic research in this subject was led by [[Yale Patt]] with his [[HPSm]] simulator.<ref>{{cite book \|url=http://dl.acm.org/citation.cfm?id=17391 \|title=HPSm, a high performance restricted data flow architecture having minimal functionality \|work=ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture \|isbn=978-0-8186-0719-6 \|pages=297–306 \|date=1986 \|access-date=2013-12-06 \|author-first1=W. \|author-last1=Hwu \|author-first2=Yale N. \|author-last2=Patt \|author-link2=Yale Patt \|publisher=[[Association for Computing Machinery\|ACM]]}}</ref> In the 1980s many early [[RISC]] microprocessors, ~~like~~had out-of-order writeback to the registers, invariably resulting in imprecise exceptions. The [[Motorola 88100]], ~~had~~was one of the few early microprocessors that did not suffer from imprecise exceptions despite out-of-order ~~writeback~~writes, toalthough ~~the~~it ~~registers,~~did ~~resulting~~allow inboth precise and imprecise floating-point exceptions.<ref>http://www.bitsavers.org/components/motorola/88000/MC88100_RISC_Microprocessor_Users_Manual_2ed_1990.pdf {{Bare URL PDF\|date=August 2025}}</ref> Instructions started execution in order, but some (e.g. floating-point) took more cycles to complete execution. However, the single-cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600. === Decoupling === Line 33: The practically attainable [[instructions per cycle\|per-cycle rate of execution]] rose further as full out-of-order execution was further adopted by [[Silicon Graphics\|SGI]]/[[MIPS Technologies\|MIPS]] ([[R10000]]) and [[Hewlett-Packard\|HP]] [[PA-RISC]] ([[PA-8000]]) in 1996. The same year [[Cyrix 6x86]] and [[AMD K5]] brought advanced reordering techniques into mainstream personal computers. Since [[DEC Alpha]] gained out-of-order execution in 1998 ([[Alpha 21264]]), the top-performing out-of-order processor cores have been unmatched by in-order cores other than [[Hewlett-Packard\|HP]]/[[Intel]] [[Itanium 2]] and [[IBM POWER6]], though the latter had an out-of-order [[floating-point unit]].<ref>Le, Hung Q. et al. {{cite journal \|title=IBM POWER6 microarchitecture \|journal=IBM Journal of Research and Development \|date=November 2007 \|volume=51 \|issue=6 \|url=https://course.ece.cmu.edu/~ece742/f12/lib/exe/fetch.php?media=le_power6.pdf}}</ref> The other high-end in-order processors fell far behind, namely [[Sun Microsystems\|Sun]]'s [[UltraSPARC III]]/[[UltraSPARC IV\|IV]], and IBM's [[mainframe]]s which had lost the out-of-order execution capability for the second time, remaining in-order into the [[IBM z10\|z10]] generation. Later big in-order processors were focused on multithreaded performance, but eventually the [[SPARC T series]] and [[Xeon Phi]] changed to out-of-order execution in 2011 and 2016 respectively.{{cn\|reason=No mention of out-of-order execution in either linked article.\|date=July 2024}} Almost all processors for phones and other lower-end applications remained in-order until {{circa\|2010}}. First, [[Qualcomm]]'s [[Scorpion (processor)\|Scorpion]] (reordering distance of 32) shipped in [[Qualcomm Snapdragon\|Snapdragon]],<ref>{{cite web \|last1=Mallia \|first1=Lou \|title=Qualcomm High Performance Processor Core and Platform for Mobile Applications \|url=http://rtcgroup.com/arm/2007/presentations/253%20-%20ARM_DevCon_2007_Snapdragon_FINAL_20071004.pdf \|archive-url=https://web.archive.org/web/20131029193001/http://rtcgroup.com/arm/2007/presentations/253%20-%20ARM_DevCon_2007_Snapdragon_FINAL_20071004.pdf \|archive-date=29 October 2013}}</ref> and a bit later [[Arm (company)\|Arm]]'s [[ARM Cortex-A9\|A9]] succeeded [[ARM Cortex-A8\|A8]]. For low-end [[x86]] [[personal computer]]s in-order [[Bonnell microarchitecture]] in early [[Intel Atom]] processors were first challenged by [[AMD]]'s [[Bobcat microarchitecture]], and in 2013 were succeeded by an out-of-order [[Silvermont microarchitecture]].<ref>{{cite web \|url=http://www.anandtech.com/show/6936/intels-silvermont-architecture-revealed-getting-serious-about-mobile/2 \|archive-url=https://archive.today/20161222023104/http://www.anandtech.com/show/6936/intels-silvermont-architecture-revealed-getting-serious-about-mobile/2 \|url-status=dead \|archive-date=December 22, 2016 \|website=AnandTech \|title=Intel's Silvermont Architecture Revealed: Getting Serious About Mobile \|author=Anand Lal Shimpi \|date=2013-05-06}}</ref> Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution is still prevalent in [[microcontroller]]s and [[embedded system]]s, as well as in phone-class cores such as Arm's [[ARM Cortex-A55\|A55]] and [[ARM Cortex-A510\|A510]] in [[big.LITTLE]] configurations. == Basic concept == Line 61: The key concept of out-of-order processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the processor avoids the stall that occurs in step 2 of the in-order processor when the instruction is not completely ready to be processed due to missing data. Out-of-order processors fill these ''slots'' in time with other instructions that ''are'' ready, then either reorder the results at the end to make it appear that the instructions were processed as normal., record and thus apply the original ''program order'', or commit in uninterruptible batches where order will not cause data corruption. The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data becomes available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output. The benefit of out-of-order processing grows as the [[instruction pipeline]] deepens and the speed difference between [[main memory]] (or [[cache memory]]) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have theoretically processed a large number of instructions. Line 82 ⟶ 83: Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight. :Early Intel out-of-order processors use a results queue called a [[reorder buffer]],{{efn\|Intel [[P6 (microarchitecture)\|P6]] family microprocessors have both a reorder buffer (ROB) and a [[register renaming\|register alias table]] (RAT). The ROB was motivated mainly by branch misprediction recovery. The Intel [[P6 (microarchitecture)\|P6]] family is among the earliest out-of-order microprocessors but were supplanted by the [[NetBurst]] architecture. Years later, NetBurst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies. Materials were not able to match the design's ambitious clock targets due to thermal issues and later designs based on NetBurst, namely Tejas and Jayhawk, were cancelled. Intel reverted to the P6 design as the basis of the [[Intel Core (microarchitecture)\|Core]] and [[Nehalem (microarchitecture)\|Nehalem]] microarchitectures.}} while most later out-of-order processors use register maps.{{efn\|The succeeding [[Sandy Bridge]], [[Ivy Bridge (microarchitecture)\|Ivy Bridge]], and [[Haswell (microarchitecture)\|Haswell]] microarchitectures are a departure from the reordering techniques used in P6 and employ reordering techniques from the [[Alpha 21264\|EV6]] and the [[Pentium 4\|P4]] but with a somewhat shorter pipeline.<ref>{{cite web \|author-last=Kanter \|author-first=David \|date=2010-09-25 \|title=Intel's Sandy Bridge Microarchitecture \|url=http://www.realworldtech.com/sandy-bridge/10/}}</ref><ref name="urlThe Haswell Front End - Intels Haswell Architecture Analyzed: Building a New PC and a New Intel">{{cite web \|url=https://www.anandtech.com/show/6355/intels-haswell-architecture/6 \|archive-url=https://web.archive.org/web/20121007163104/http://www.anandtech.com/show/6355/intels-haswell-architecture/6 \|url-status=dead \|archive-date=October 7, 2012 \|title=The Haswell Front End - Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel }}</ref>}} == See also ==