Out-of-order execution: Difference between revisions

Content deleted Content added
section has sources
review: thin sourced. rm rep and misplaced. link improvements. tag engvar. grammar.
Line 1:
{{Short description|Computing paradigm to improve computational efficiency}}
{{Redirect|OOE||Ooe (disambiguation)}}
{{Use American English|date=January 2025}}
 
In [[computer engineering]], '''out-of-order execution''' (or more formally '''dynamic execution''') is a [[paradigm]] used in high-performance [[central processing unit]]s to make use of [[instruction cycle]]s that would otherwise be wasted. In this paradigm, a processor executes [[Instruction (computing)|instructions]] in an order governed by the availability of input data and execution units,<ref>{{cite book |author-last=Kukunas |author-first=Jim |date=2015 |title=Power and Performance: Software Analysis and Optimization |url=https://books.google.com/books?id=X-WcBAAAQBAJ&pg=PA37 |publisher=Morgan Kaufman |page=37 |isbn=9780128008140}}</ref> rather than by their original order in a program.<ref>{{cite web |url=http://courses.cs.washington.edu/courses/csep548/06au/lectures/introOOO.pdf |title=Out-of-order execution |date=2006 |quote=don't wait for previous instructions to execute if this instruction does not depend on them |access-date=2014-01-17 |publisher=cs.washington.edu}}</ref><ref name="Regis High School 2011">{{cite web | title=The Centennial Celebration | website=Regis High School | date=2011-03-14 | url=https://www.regis.org/2014/multimedia/tomasulo.cfm | access-date=2022-06-25|quote=The algorithm "allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially" (also known as out of order execution).}}</ref> In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.<ref>{{cite web |url=http://www.pcguide.com/ref/cpu/arch/int/featOOE-c.html |quote=This flexibility improves performance since it allows execution with less 'waiting' time. |title=Out-of-order Execution |publisher=pcguide.com |access-date=2014-01-17}}</ref>
Line 17 ⟶ 18:
To have [[precise exception]]s, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by [[James E. Smith (engineer)|James E. Smith]] and Andrew R. Pleszkun.<ref name="smith">{{cite journal |last1=Smith |first1=James E. |last2=Pleszkun |first2=Andrew R. |author1-link=James E. Smith (engineer) |title=Implementation of precise interrupts in pipelined processors |journal=12th ISCA|date=June 1985 |url=https://dl.acm.org/doi/epdf/10.5555/327010.327125}}<br/>(Expanded version published in May 1988 as [https://www.cs.virginia.edu/~evans/greatworks/smith.pdf ''Implementing Precise Interrupts in Pipelined Processors''].)</ref> The [[CDC Cyber 205]] was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an ''invisible exchange package'', so that it can resume at the same state of execution.<ref>{{cite web |last1=Moudgill |first1=Mayan |last2=Vassiliadis |first2=Stamatis |title=On Precise Interrupts |page=18 |date=January 1996 |citeseerx=10.1.1.33.3304 |url=https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.3304&rep=rep1&type=pdf |archive-url=https://web.archive.org/web/20221013035408/https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.3304&rep=rep1&type=pdf |archive-date=13 October 2022 |format=pdf}}</ref> However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions.<ref name="smith"/> Through simulation, Smith determined that adding a reorder buffer (or history buffer or equivalent) to the [[Cray-1S]] would reduce the performance of executing the first 14 [[Livermore loops]] (unvectorized) by only 3%.<ref name="smith"/> Important academic research in this subject was led by [[Yale Patt]] with his [[HPSm]] simulator.<ref>{{cite book |url=http://dl.acm.org/citation.cfm?id=17391 |title=HPSm, a high performance restricted data flow architecture having minimal functionality |work=ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture |isbn=978-0-8186-0719-6 |pages=297–306 |date=1986 |access-date=2013-12-06 |author-first1=W. |author-last1=Hwu |author-first2=Yale N. |author-last2=Patt |author-link2=Yale Patt |publisher=[[Association for Computing Machinery|ACM]]}}</ref>
 
In the 1980s many early [[RISC]] microprocessors, like the [[Motorola 88100]], had out-of-order writeback to the registers, resulting in imprecise exceptions. Instructions started execution in order, but some (e.g. floating-point) took more cycles to complete execution. However, the single-cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600.
 
=== Decoupling ===
Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching.<ref>{{cite journal |last1=Smith |first1=James E. |author1-link=James E. Smith (engineer) |title=Decoupled Access/Execute Computer Architectures |journal=ACM Transactions on Computer Systems |date=November 1984 |volume=2 |issue=4 |pages=289–308 |doi=10.1145/357401.357403 |s2cid=13903321 |url=https://course.ece.cmu.edu/~ece447/s15/lib/exe/fetch.php?media=p289-smith.pdf}}</ref> He implemented those ideas in the [[Astronautics Corporation of America|Astronautics]] ZS-1 (1988), featuring a decoupling of the integer/load/store [[Instruction pipelining|pipeline]] from the floating-point pipeline, allowing inter-pipeline reordering. The ZS-1 was also capable of executing loads ahead of preceding stores. In his 1984 paper, he opined that enforcing the precise exceptions only on the integer/memory pipeline should be sufficient for many use cases, as it even permits [[virtual memory]]. Each pipeline had an instruction buffer to decouple it from the instruction decoder, to prevent the stalling of the front end. To further decouple the memory access from execution, each of the two pipelines was associated with two addressable [[FIFO (computing and electronics)|queues]] that effectively performed limited register renaming.<ref name=zs1>{{cite journal |last1=Smith |first1=James E. |author1-link=James E. Smith (engineer) |title=Dynamic Instruction Scheduling and the Astronautics ZS-1 |journal=Computer |url=https://course.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=00030730.pdf |pages=21–35 |doi=10.1109/2.30730 |date=July 1989 |volume=22 |issue=7 |s2cid=329170 }}</ref> A similar decoupled architecture had been used a bit earlier in the Culler 7.<ref>{{cite web |last1=Smotherman |first1=Mark |title=Culler-7 |url=https://people.computing.clemson.edu/~mark/culler.html |website=[[Clemson University]]}}</ref> The ZS-1's ISA, like IBM's subsequent POWER, aided the early execution of branches.
 
=== Research comes to fruition ===
Line 62 ⟶ 63:
Out-of-order processors fill these ''slots'' in time with other instructions that ''are'' ready, then reorder the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data becomes available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output.
 
The benefit of out-of-order processing grows as the [[instruction pipeline]] deepens and the speed difference between [[main memory]] (or [[cache memory]]) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have theoretically processed a large number of instructions.<!--[[User:Kvng/RTH]]-->
 
== Dispatch and issue decoupling allows out-of-order issue ==
One of the differences created by the new paradigm is the creation of queues that allowsallow the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was ''decoupled architecture''. In the earlier ''in-order'' processors, these stages operated in a fairly [[Lockstep (computing)|lock-step]], pipelined fashion.
 
The [[Instruction cycle|fetch and decode stages]] is separated from the execute stage in a [[Pipeline (computing)|pipelined]] processor by using a [[Data buffer|buffer]]. The buffer's purpose is to partition the [[Memory access pattern|memory access]] and execute functions in a computer program and achieve high- performance by exploiting the fine-grain [[parallel computing|parallelism]] between the two.<ref>{{cite journal |author-last=Smith |author-first1=J. E. |title=Decoupled access/execute computer architectures |journal= ACM Transactions on Computer Systems|date=1984 |volume=2 |issue=4 |pages=289–308 |citeseerx=10.1.1.127.4475 |doi=10.1145/357401.357403|s2cid=13903321 }}</ref> In doing so, it effectively hides all [[memory latency]] from the processor's perspective.
The instructions of the program may not be run in the originally specified order, as long as the end result is correct. It separates the [[Instruction cycle|fetch and decode stages]] from the execute stage in a [[Pipeline (computing)|pipelined]] processor by using a [[Data buffer|buffer]].
 
A larger buffer can, in theory, increase throughput. However, if the processor has a [[branch misprediction]] then the entire buffer may need to be flushed, wasting a lot of [[clock cycle]]s and reducing the effectiveness. Furthermore, larger buffers create more heat and use more [[Die (integrated circuit)|die]] space. For this reason processor designers today favourfavor a [[multi-threaded]] design approach.
The buffer's purpose is to partition the [[Memory access pattern|memory access]] and execute functions in a computer program and achieve high-performance by exploiting the fine-grain [[parallel computing|parallelism]] between the two.<ref>{{cite journal |author-last=Smith |author-first1=J. E. |title=Decoupled access/execute computer architectures |journal= ACM Transactions on Computer Systems|date=1984 |volume=2 |issue=4 |pages=289–308 |citeseerx=10.1.1.127.4475 |doi=10.1145/357401.357403|s2cid=13903321 }}</ref> In doing so, it effectively hides all [[memory latency]] from the processor's perspective.
 
Decoupled architectures are generally thought of as not useful for general -purpose computing as they do not handle control -intensive code well.<ref>{{cite journal |author-last1=Kurian |author-first1=L. |author-last2=Hulina |author-first2=P. T. |author-last3=Coraor |author-first3=L. D. |title=Memory latency effects in decoupled architectures |journal=[[IEEE Transactions on Computers]] |volume=43 |issue=10 |date=1994 |pages=1129–1139 |doi=10.1109/12.324539 |s2cid=6913858 |url=https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf |archive-url=https://web.archive.org/web/20180612141055/https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf |url-status=dead |archive-date=2018-06-12 }}</ref> Control intensive code include such things as nested branches that occur frequently in [[operating system]] [[kernel (operating system)|kernels]]s. Decoupled architectures play an important role in scheduling in [[very long instruction word]] (VLIW) architectures.<ref>{{cite journal |author-first1=M. N. |author-last1=Dorojevets |author-first2=V. |author-last2=Oklobdzija |title=Multithreaded decoupled architecture |journal=International Journal of High Speed Computing |volume=7 |issue=3 |pages=465–480 |date=1995 |doi=10.1142/S0129053395000257 |url=https://www.researchgate.net/publication/220171480}}</ref><!--[[User:Kvng/RTH]]-->
A larger buffer can, in theory, increase throughput. However, if the processor has a [[branch misprediction]] then the entire buffer may need to be flushed, wasting a lot of [[clock cycle]]s and reducing the effectiveness. Furthermore, larger buffers create more heat and use more [[Die (integrated circuit)|die]] space. For this reason processor designers today favour a [[multi-threaded]] design approach.
 
Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well.<ref>{{cite journal |author-last1=Kurian |author-first1=L. |author-last2=Hulina |author-first2=P. T. |author-last3=Coraor |author-first3=L. D. |title=Memory latency effects in decoupled architectures |journal=[[IEEE Transactions on Computers]] |volume=43 |issue=10 |date=1994 |pages=1129–1139 |doi=10.1109/12.324539 |s2cid=6913858 |url=https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf |archive-url=https://web.archive.org/web/20180612141055/https://pdfs.semanticscholar.org/6aa3/18cce633e3c2d86d970d6d50104d818d9407.pdf |url-status=dead |archive-date=2018-06-12 }}</ref> Control intensive code include such things as nested branches that occur frequently in [[operating system]] [[kernel (operating system)|kernels]]. Decoupled architectures play an important role in scheduling in [[very long instruction word]] (VLIW) architectures.<ref>{{cite journal |author-first1=M. N. |author-last1=Dorojevets |author-first2=V. |author-last2=Oklobdzija |title=Multithreaded decoupled architecture |journal=International Journal of High Speed Computing |volume=7 |issue=3 |pages=465–480 |date=1995 |doi=10.1142/S0129053395000257 |url=https://www.researchgate.net/publication/220171480}}</ref>
 
To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called [[register renaming]] is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time.
 
== Execute and writeback decoupling allows program restart ==