Tomasulo's algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 02:17, 4 January 2018 edit Paul Foxworthy (talk \| contribs) Extended confirmed users 2,731 edits m →Applications and legacy: Links to Intel and x86-64 ← Previous edit		Latest revision as of 07:23, 10 August 2024 edit undo 142.113.140.146 (talk) →Applications and legacy: c/e Tag: Visual edit
(43 intermediate revisions by 30 users not shown)
Line 1: {{Short description\|Computer architecture hardware algorithm}} '''~~Tomasulo’s~~Tomasulo's algorithm''' is a [[computer architecture]] hardware [[algorithm]] for dynamic scheduling of instructions that allows [[out-of-order execution]] and enables more efficient use of multiple execution units. It was developed by [[Robert Tomasulo]] at [[IBM]] in 1967 and was first implemented in the [[IBM System/360 Model 91]]’s [[floating point unit]].<ref name="tomasulo"/>{{rp}} The major innovations of Tomasulo’s algorithm include [[register renaming]] in hardware, [[reservation station~~\|reservation stations~~]]s for all execution units, and a common data bus (CDB) on which computed values broadcast to all reservation stations that may need them. These developments allow for improved [[parallel computing\|parallel execution]] of instructions that would otherwise stall under the use of [[scoreboarding]] or other earlier algorithms. [[Robert Tomasulo]] received the [[~~Eckert-Mauchly~~Eckert–Mauchly Award]] in 1997 for his work on the algorithm.<ref>{{cite web \| url =http://awards.acm.org/award_winners/tomasulo_4008463.cfm \| title =Robert Tomasulo – Award Winner \| website =ACM Awards \| publisher =ACM \| ~~accessdate~~access-date =8 December 2014 }}</ref> ==Implementation concepts== [[File:Tomasulo Architecture.png\|thumb\|right\|700px\|Tomasulo's floating point unit]] The following are the concepts necessary to the implementation of Tomasulo's ~~Algorithm~~algorithm: ===Common data bus=== The Common Data Bus (CDB) connects reservation stations directly to functional units. According to Tomasulo it "preserves precedence while encouraging concurrency".<ref name="tomasulo"> {{cite journal \| last1 = Tomasulo \| first1 = Robert M.Marco \| author1-link = Robert Marco Tomasulo \| title = An Efficient Algorithm for Exploiting Multiple Arithmetic Units \| journal = IBM Journal of Research and Development Line 28 ⟶ 30: \| date = Jan 1967 \| issn = 0018-8646 \| doi = 10.1147/rd.111.0025 \| s2cid = 8445049 }}</ref>{{rp\|33}} This has two important effects: #Functional units can access the result of any operation without involving a floating-point-register, allowing multiple units waiting on a result to proceed without waiting to resolve contention for access to register file read ports. #Hazard Detection and control execution are distributed. The reservation stations control when an instruction can execute, rather than a single dedicated hazard unit. ===Instruction order=== {{Disputed section\|Imprecise exceptions\|date=December 2023}} Instructions are issued sequentially so that the effects of a sequence of instructions, such as [[exception (computing)\|exception]]s raised by these instructions, occur in the same order as they would on an in-order processor, regardless of the fact that they are being executed out-of-order (i.e. non-sequentially). ===Register renaming=== Tomasulo's ~~Algorithm~~algorithm uses [[register renaming]] to correctly perform out-of-order execution. All general-purpose and reservation station registers hold either a real value or a placeholder value. If a real value is unavailable to a destination register during the issue stage, a placeholder value is initially used. The placeholder value is a tag indicating which reservation station will produce the real value. When the unit finishes and broadcasts the result on the CDB, the placeholder will be replaced with the real value. Each functional unit has a single reservation station. Reservation stations hold information needed to execute a single instruction, including the operation and the operands. The functional unit begins processing when it is free and when all source operands needed for an instruction are real. ===Exceptions=== Practically speaking, there may be exceptions for which not enough status information about an exception is available, in which case the processor may raise a special exception, called an "'''imprecise" exception'''. Imprecise exceptions cannot occur in ~~non-~~[[Out-of-order execution\|~~OoOE~~in-order]] implementations, as processor state is changed only in program order (see ~~[[Classic_RISC_pipeline#Exceptions~~{{slink\|Classic RISC ~~Pipeline~~ pipeline\|Exceptions]]}}). Programs that experience "'''precise" exceptions''', where the specific instruction that took the exception can be determined, can restart or re-execute at the point of the exception. However, those that experience "imprecise" exceptions generally cannot restart or re-execute, as the system cannot determine the specific instruction that took the exception. ==Instruction lifecycle== The three stages listed below are the stages through which each instruction passes from the time it is issued to the time its execution is complete. ===~~Register legend~~Legend=== RS - Reservation Status RegisterStat - Register Status; contains information about the registers. regs[x] - Value of register x Mem[A] - Value of memory at address A rd - destination register number rs, rt - source registration numbers imm - sign extended immediate field r - reservation station or buffer that the instruction is assigned to ====Reservation Station Fields==== Op - represents the operation being performed on operands Line 55 ⟶ 71: A - used to hold the memory address information for a load or store Busy - 1 if occupied, 0 if not occupied Qi - (Only register unit) the reservation station whose result should be stored in this register (if blank or 0, no values are destined for this register)▼ ====Register Status Fields==== ▲Qi - ~~(Only register unit)~~ the reservation station whose result should be stored in this register (if blank or 0, no values are destined for this register) ===Stage 1: issue=== Line 65 ⟶ 83: Otherwise, we can assume the operands are not in the registers, and so use virtual values. The functional unit must calculate the real value to keep track of the functional units that produce the operand. ~~====Pseudocode====~~ {\|class="wikitable" \|}+ Pseudocode<ref name="hennessy">▼ {{cite book▼ \| last1 = Hennessy \| first1 = John L.▼ \| first2 = David A.\| last2 = Patterson▼ \| title = [[Computer Architecture: A Quantitative Approach]]▼ \| ___location = Waltham, MA▼ \| publisher = [[Elsevier]]▼ \| date = 2012▼ \| isbn = 978-0123838728 \| title-link = Computer Architecture: A Quantitative Approach }} </ref>{{rp\|180}}▼ \|- !Instruction state Line 72 ⟶ 100: !Action or bookkeeping \|- \|FP operation ~~\|Issue~~ \|Station {{mono\|r}} empty \|<~~source~~syntaxhighlight lang="c"> if (RegisterStat[rs].Qi¦0) { RS[r].Qj ← RegisterStat[rs].Qi } else { Line 86 ⟶ 114: } else { RS[r].Vk ← Regs[rt]; RS[r].Qk ← 0; } RS[r].Busy ← yes; RegisterStat[rd].QQi ← r; </syntaxhighlight> ~~</source>~~ \|- \|Load or Store \|Buffer {{mono\|r}} empty \|<~~source~~syntaxhighlight lang="c"> if (RegisterStat[rs].Qi¦0) { RS[r].Qj ← RegisterStat[rs].Qi; Line 104 ⟶ 133: RS[r].A ← imm; RS[r].Busy ← yes; </syntaxhighlight> ~~</source>~~ \|- \|Load only \| \|<~~source~~syntaxhighlight lang="c"> RegisterStat[rt].Qi ← r; </syntaxhighlight> ~~</source>~~ \|- \|Store only \| \|<~~source~~syntaxhighlight lang="c"> if (RegisterStat[rt].Qi¦0) { RS[r].Qk ← RegisterStat[rsrt].Qi; } else { Line 122 ⟶ 151: RS[r].Qk ← 0 }; </syntaxhighlight> ~~</source>~~ \|} ▲\|} <ref name="hennessy"> ▲{{cite book ▲ \| last1 = Hennessy \| first1 = John L. ▲ \| first2 = David A.\| last2 = Patterson ▲ \| title = [[Computer Architecture: A Quantitative Approach]] ▲ \| ___location = Waltham, MA ▲ \| publisher = [[Elsevier]] ▲ \| date = 2012 ~~\| isbn = 978-0123838728 }}~~ ▲</ref>{{rp\|180}} [[File:Example of Tomasulo's Algorithm.gif\|thumb\|Example of Tomasulo's ~~Algorithm~~algorithm<ref>{{cite web \|url=http://courses.cs.washington.edu/courses/csep548/06au/lectures/tomasulo.pdf \|title=CSE P548 - Tomasulo \|date= 2006 \|website=washington.edu \|publisher=Washington University \|~~accessdate~~access-date=8 December 2014}}</ref>]] ===Stage 2: execute=== In the execute stage, the instruction operations are carried out. Instructions are delayed in this step until all of their operands are available, eliminating RAW hazards. Program correctness is maintained through effective address calculation to prevent hazards through memory. If one or more of the operands is not yet available then: wait for operand to become available on the CDB. Line 147 ⟶ 167: Else, the instruction is an [[arithmetic logic unit]] (ALU) operation then: execute the instruction at the corresponding functional unit ~~====Pseudocode====~~ {\|class="wikitable" \|}+ Pseudocode<ref name="hennessy" /> {{rp\|180}}▼ \|- !Instruction state Line 155 ⟶ 175: \|- \|FP operation \| <~~source~~syntaxhighlight lang="ctext"> (RS[r].Qj = 0) and (RS[r].Qk = 0) </syntaxhighlight> ~~</source>~~ \| ~~<source lang="c">~~ Compute result: operands are in Vj and Vk ~~</source>~~ \|- \|Load/store step 1 \|<code>RS[r].Qj = 0</code> & r is head of load-store queue \|<~~source~~syntaxhighlight lang="ctext"> RS[r].A ← RS[r].Vj + RS[r].A; </syntaxhighlight> ~~</source>~~ \|- \|Load step 2 \|Load step 1 complete \| ~~\|<source lang="c">~~ Read from <code>Mem[RS[r].A]</code> \|} ~~</source>~~ ▲\|} <ref name="hennessy" /> {{rp\|180}} ===Stage 3: write result=== Line 182 ⟶ 199: * Else, if the instruction was a store then: write the data to memory during this step ~~====Pseudocode====~~ {\|class="wikitable" \|}+ Pseudocode<ref name="hennessy" /> {{rp\|180}}▼ \|- !Instruction state Line 190 ⟶ 207: \|- \|FP operation or load \|Execution complete at {{mono\|r}} & CDB available \|<~~source~~syntaxhighlight lang="c"> ∀x(if (RegisterStat[x].Qi = r) { regs[x] ← result; Line 205 ⟶ 222: }); RS[r].Busy ← no; </syntaxhighlight> ~~</source>~~ \|- \|Store \|Execution complete at {{mono\|1=r & RS[r].Qk = 0}} \|<~~source~~syntaxhighlight lang="c"> Mem[RS[r].A] ← RS[r].Vk; RS[r].Busy ← no; </syntaxhighlight> ~~</source>~~ \|} ▲\|}<ref name="hennessy" /> {{rp\|180}} ==Algorithm improvements== Line 220 ⟶ 237: Reservation stations take on the responsibility of waiting for operands in the presence of data dependencies and other inconsistencies such as varying storage access time and circuit speeds, thus freeing up the functional units. This improvement overcomes long floating point delays and memory accesses. In particular the algorithm is more tolerant of cache misses. Additionally, programmers are freed from implementing optimized code. This is a result of the common data bus and reservation station working together to preserve dependencies as well as encouraging concurrency.<ref name="tomasulo"/>{{rp\|33}} By tracking operands for instructions in the reservation stations and register renaming in hardware the algorithm minimizes [[Read after write (Hazard)\|read-after-write]] (RAW) and eliminates [[Write after write (hazard)#~~Write_after_write_~~Write after write .28WAW.29\|write-after-write]] (WAW) and [[Write after read (hazard)#~~Write_after_read_~~Write after read .28WAR.29\|Write-after-Read]] (WAR) [[computer architecture]] [[hazard (computer architecture)\|hazard]]s. This improves performance by reducing wasted time that would otherwise be required for stalls.<ref name="tomasulo"/>{{rp\|33}} An equally important improvement in the algorithm is the design is not limited to a specific pipeline structure. This improvement allows the algorithm to be more widely adopted by [[multiple issue\|multiple-issue]] processors. Additionally, the algorithm is easily extended to enable branch speculation.<ref name="hennessy" /> {{rp\|182}} == Applications and legacy == Tomasulo's algorithm, ~~outside~~was ofimplemented ~~IBM,~~in ~~was~~the ~~unused~~System/360 ~~for~~Model ~~several~~91 ~~years~~architecture. ~~after~~Outside ~~its~~of ~~implementation~~IBM, init ~~the~~went ~~System/360~~unused ~~Model~~for 91several ~~architecture~~years. However, it saw a vast increase in usage during the 1990s for 3 reasons: # Once caches became commonplace, ~~Tomasulo's~~the algorithm's ability to maintain concurrency during unpredictable load times caused by cache misses became valuable in processors. # Dynamic scheduling and ~~the~~ branch speculation ~~that~~from the algorithm enables ~~helped~~improved ~~performances~~performance as processors issued more and more instructions. # Proliferation of mass-market software meant that programmers would not want to compile for a specific pipeline structure. The algorithm can function with any pipeline architecture and thus software requires few architecture-specific modifications.<ref name="hennessy" /> {{rp\|183}} Many modern processors implement dynamic scheduling schemes that are ~~derivative~~variants of ~~Tomasulo’s~~Tomasulo's original algorithm, including popular [[Intel]] [[x86-64]] chips.<ref name="intel" >{{Cite report \| date = September 2014 \| title = Intel® 64 and IA-32 Architectures Software ~~Developer’s~~Developer's Manual \| url = http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-1-manual.html \| publisher = Intel \| ~~accessdate~~access-date = 8 December 2014 }}</ref>{{Failed verification\|date=February 2017\|reason=The word Tomasulo isn't even mentioned?}}<ref name="adusan">{{cite web\|last1=Yoga\|first1=Adarsh\|title=Differences between Tomasulo's algorithm and dynamic scheduling in Intel Core microarchitecture\|url=http://adusan.blogspot.com.au/2010/11/differences-between-tomasulos-algorithm.html\|website=The boozier\|~~accessdate~~access-date=4 April 2016}}</ref> == See also == * [[Re-order buffer]] (ROB) * [[Instruction -level parallelism]] (ILP) * [[Out-of-order execution]] == External links ==▼ * [http://www.cs.umd.edu/class/fall2001/cmsc411/projects/dynamic/tomasulo.html Dynamic Scheduling - Tomasulo's Algorithm]▼ * [http://www.dcs.ed.ac.uk/home/hase/webhase/demo/tomasulo.html HASE Java applet simulation of the Tomasulo's Algorithm]▼ == References == {{Reflist}} == Further reading == * {{cite web \|title=Pipelined and Out-of-Order Execution \|author-first=John J. G. \|author-last=Savard \|date=2018 \|orig-year=2014 \|work=quadibloc \|url=http://www.quadibloc.com/comp/cp07.htm \|access-date=2018-07-16 \|url-status=live \|archive-url=https://web.archive.org/web/20180703004239/http://www.quadibloc.com/comp/cp07.htm \|archive-date=2018-07-03}} ▲== External links == ▲* [{{webarchive\|url=https://web.archive.org/web/20171225215322/http://www.cs.umd.edu/class/fall2001/cmsc411/projects/dynamic/tomasulo.html \|title=Dynamic Scheduling - Tomasulo's Algorithm]\|date=December 25, 2017}} ▲* [http://www.dcs.ed.ac.uk/home/hase/webhase/demo/tomasulo.html HASE Java applet simulation of the Tomasulo's ~~Algorithm~~algorithm] [[Category:Algorithms]]