Classic RISC pipeline: Difference between revisions

Content deleted Content added
Tags: Mobile edit Mobile web edit Advanced mobile edit
 
(36 intermediate revisions by 30 users not shown)
Line 1:
{{Short description|Instruction pipeline}}
{{Use American English|date = March 2019}}
{{NoMore footnotes|date=December 2012}}
 
In the [[history of computing hardware|history of computer hardware]], some early [[reduced instruction set computer]] [[central processing unit]]s (RISC CPUs) used a very similar architectural solution, now called a '''classic RISC pipeline'''. Those CPUs were: [[MIPS architecture|MIPS]], [[SPARC]], Motorola [[Motorola 88000|88000]], and later the notional CPU [[DLX]] invented for education.
 
Each of these classic scalar RISC designs fetchedfetches and triedtries to execute one [[Instructions per cycle|one instruction per cycle]]. The main common concept of each design wasis a five-stage execution [[instruction pipeline]]. During operation, each pipeline stage workedworks on one instruction at a time. Each of these stages consistedconsists of an initiala set of [[flip-flop (electronics)|flip-flops]] to hold state, and [[combinational logic]] that operatedoperates on the outputs of those flip-flops.
 
==The classic five stage RISC pipeline==
Line 11:
 
===Instruction fetch===
The instructions reside in memory that takes one cycle to read. This memory can be dedicated to SRAM, or an Instruction [[Cache (computing)|Cache]]. on theseThe machines had aterm "latency" ofis oneused cycle(machinein cycle),computer meaningscience thatoften ifand means the instructiontime wasfrom inwhen thean cache,operation starts until it wouldcompletes. be readyThus, oninstruction thefetch nexthas a latency of one [[clock cycle]] (if using single-cycle SRAM or if the instruction was in the cache). During Thus, during the [[Instruction fetch|Instruction Fetch]] stage, a 32-bit instruction wasis fetched from the cacheinstruction memory.
 
The [[Programprogram Countercounter]], or (PC,) is a register that holds the address ofthat theis currentpresented to the instruction memory. The Itaddress feedsis intopresented theto PCinstruction predictor,memory whichat thenthe sendsstart theof [[Programa Counter]]cycle. (PC)Then toduring the Instructioncycle, Cachethe toinstruction is read theout currentof instruction. memory, Atand at the same time, thea PCcalculation predictoris done to predictsdetermine the addressnext ofPC. theThe next instructionPC is calculated by incrementing the PC by 4, and by choosing whether to take that as the next PC or to take the result of a branch/jump calculation as the next PC. Note that in classic RISC, (all instructions werehave 4the bytessame long)length. (This predictionis wasone thing that separates RISC from CISC<ref>{{cite web |first=David |last=Patterson| title=RISC I: A Reduced Instruction Set VLSI Computer |series=Isca '81|date=12 May 1981|pages=443–457|url=https://dl.acm.org/doi/10.5555/800052.801895}}</ref>). In the original RISC designs, the size of an instruction is 4 bytes, so always wrongadd in4 to the instruction address, but don't use PC + 4 for the case of a taken branch, jump, or exception (see '''delayed branches''', below). Later(Note machinesthat wouldsome modern machines use more complicated and accurate algorithms ([[branch prediction]] and [[branch target predictor|branch target prediction]]) to guess the next instruction address.)
 
===Instruction decode===
UnlikeAnother earlierthing microcodedthat machines,separates the first RISC machines hadfrom earlier CISC machines, is that RISC has no [[microcode]].<ref>{{cite web Once|first=David |last=Patterson| title=RISC I: A Reduced Instruction Set VLSI Computer |series=Isca '81|date=12 May 1981|pages=443–457|url=https://dl.acm.org/doi/10.5555/800052.801895}}</ref> In the case of CISC micro-coded instructions, once fetched from the instruction cache, the instruction bits wereare shifted down the pipeline, so thatwhere simple combinational logic in each pipeline stage could produce theproduces control signals for the datapath directly from the instruction bits. AsIn athose resultCISC designs, very little decoding is done in the stage traditionally called the decode stage. A consequence of this lack of decoding meant howeveris that more instruction bits hadhave to be used to specifying what the instruction shoulddoes. do (and also, what it should not), and thatThat leaves fewer bits for things like register indices.
 
All MIPS, SPARC, and DLX instructions have at most two register inputs. During the decode stage, the indexes of these two register namesregisters are identified within the instruction, and the indexes are presented to the register memory, as the address. Thus the two registers named are read from the [[register file]]. In the MIPS design, the register file had 32 entries.
 
At the same time the register file wasis read, instruction issue logic in this stage determineddetermines if the pipeline wasis ready to execute the instruction in this stage. If not, the issue logic would causecauses both the Instruction Fetch stage and the Decode stage to stall. On a stall cycle, the stagesinput wouldflip preventflops theirdo initialnot flip-flopsaccept fromnew acceptingbits, thus no new bitscalculations take place during that cycle.
 
If the instruction decoded wasis a branch or jump, the target address of the branch or jump wasis computed in parallel with reading the register file. The branch condition is computed in the following cycle (after the register file is read), and if the branch is taken or if the instruction is a jump, the PC predictor in the first stage is assigned the branch target, rather than the incremented PC that has been computed. Some architectures made use of the [[Arithmetic logic unit|ALU]] (ALU) in the Execute stage, at the cost of slightly decreasedecreased instruction throughput.
 
The decode stage ended up with quite a lot of hardware: MIPS hadhas the possibility of branching if two registers wereare equal, so a 32-bit-wide AND tree ranruns in series after the register file read, making a very long critical path through this stage (which means fewer cycles per second). Also, the branch target computation generally required a 16 bit add and a 14 bit incrementer. Resolving the branch in the decode stage made it possible to have just a single-cycle branch mispredictmis-predict penalty. Since branches were very often taken (and thus mispredictedmis-predicted), it was very important to keep this penalty low.
 
===Execute===
The Execute stage is where the actual computation occurs. Typically this stage consists of an Arithmetic and Logic UnitALU, and also a bit shifter. It may also include a multiple cycle multiplier and divider.
 
The Arithmetic and Logic UnitALU is responsible for performing booleanBoolean operations (and, or, not, nand, nor, xor, xnor) and also for performing integer addition and subtraction. Besides the result, the ALU typically provides status bits such as whether or not the result was 0, or if an overflow occurred.
 
The bit shifter is responsible for shift and rotations.
Line 35:
Instructions on these simple RISC machines can be divided into three latency classes according to the type of the operation:
 
* Register-Register Operation (Single-cycle latency): Add, subtract, compare, and logical operations. During the execute stage, the two arguments were fed to a simple [[Arithmetic logic unit|ALU]], which generated the result by the end of the execute stage.
* Memory Reference (Two-cycle latency). All loads from memory. During the execute stage, the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the cycle.
*[[Cycles per instruction|Multi-cycle Instructions]] (Many cycle latency). Integer multiply and divide and all [[floating-point]] operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work. To avoid complicating the writeback stage and issue logic, multicycle instruction wrote their results to a separate set of registers.
Line 41:
===Memory access===
 
If data memory needs to be accessed, it is done so in this stage.
 
During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both one and two cycle instructions always write their results in the same stage of the pipeline so that just one write port to the register file can be used, and it is always available.
Line 50:
 
During this stage, both single cycle and two cycle instructions write their results into the register file.
Note that two different stages are accessing the register file at the same time—the decode stage is reading two source registers, at the same time that the writeback stage is writing a previous instruction's destination register.
On real silicon, this can be a hazard (see below for more on hazards). That is because one of the source registers being read in decode might be the same as the destination register being written in writeback. When that happens, then the same memory cells in the register file are being both read and written the same time. On silicon, many implementations of memory cells will not operate correctly when read and written at the same time.
 
==Hazards==
Line 70 ⟶ 72:
 
Suppose the CPU is executing the following piece of code:
<sourcesyntaxhighlight lang="nasm">
SUB r3,r4 -> r10 ; Writes r3 - r4 to r10
AND r10,r3 -> r11 ; Writes r10 && r3 to r11
</syntaxhighlight>
</source>
The instruction fetch and decode stages send the second instruction one cycle after the first. They flow down the pipeline as shown in this diagram:
 
Line 96 ⟶ 98:
 
However, consider the following instructions:
<syntaxhighlight lang="nasm">
LD adr -> r10
ANDLD adr r10,r3 -> r11r10
AND r10,r3 -> r11
The data read from the address <code>adr</code> isn't present in the data cache until after the Memory Access stage of the <code>LD</code> instruction. By this time, the <code>AND</code> instruction is already through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the <code>AND</code> instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are '''stalled''' - they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the <code>LD</code> and <code>AND</code> instructions.
</syntaxhighlight>
The data read from the address <code>adr</code> isn'tis not present in the data cache until after the Memory Access stage of the <code>LD</code> instruction. By this time, the <code>AND</code> instruction is already through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the <code>AND</code> instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are '''stalled''' - they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the <code>LD</code> and <code>AND</code> instructions.
 
This NOP is termed a pipeline ''[[bubble (computing)|bubble]]'' since it floats in the pipeline, like an air bubble in a water pipe, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a '''pipeline interlock'''.
 
{| align=center style="text-align:center"
Line 126 ⟶ 130:
* Predict Not Taken: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch is not taken. If the branch is not taken, the pipeline stays full. If the branch is taken, the instruction is flushed (marked as if it were a NOP), and one cycle's opportunity to finish an instruction is lost.
* Branch Likely: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch was taken. The compiler can always fill the branch delay slot on such a branch, and since branches are more often taken than not, such branches have a smaller IPC penalty than the previous kind.
* [[Branch delay slot|Branch Delay Slot]]: AlwaysDepending fetchon the instructiondesign afterof the delayed branch fromand the instructionbranch cacheconditions, andit alwaysis executedetermined it,whether the instruction immediately following the branch instruction is executed even if the branch is taken. Instead of taking an IPC penalty for some fraction of branches either taken (perhaps 60%) or not taken (perhaps 40%), branch delay slots take an IPC penalty for those branches into which the compiler could not schedule the branch delay slot. The SPARC, MIPS, and MC88K designers designed a branch delay slot into their ISAs.
* [[Branch Prediction]]: In parallel with fetching each instruction, guess if the instruction is a branch or jump, and if so, guess the target. On the cycle after a branch or jump, fetch the instruction at the guessed target. When the guess is wrong, flush the incorrectly fetched target.
 
Delayed branches were controversial, first, because their semantics isare complicated. A delayed branch specifies that the jump to a new ___location happens ''after'' the next instruction. That next instruction is the one unavoidably loaded by the instruction cache after the branch.
 
Delayed branches have been criticized{{By whom|date=May 2012}} as a poor short-term choice in ISA design:
Line 138 ⟶ 142:
==Exceptions==
 
Suppose a 32-bit RISC processes an ADD instruction that adds two large numbers, and the result does not fit in 32 bits. What happens?
 
The simplest solution, provided by most architectures, is wrapping arithmetic. Numbers greater than the maximum possible encoded value have their most significant bits chopped off until they fit. In the usual integer number system, 3000000000+3000000000=6000000000. With unsigned 32 bit wrapping arithmetic, 3000000000+3000000000=1705032704 (6000000000 mod 2^32). This may not seem terribly useful. The largest benefit of wrapping arithmetic is that every operation has a well defined result.
Line 144 ⟶ 148:
But the programmer, especially if programming in a language supporting [[large numbers|large integers]] (e.g. [[Lisp (programming language)|Lisp]] or [[Scheme (programming language)|Scheme]]), may not want wrapping arithmetic. Some architectures (e.g. MIPS), define special addition operations that branch to special locations on overflow, rather than wrapping the result. Software at the target ___location is responsible for fixing the problem. This special branch is called an exception. Exceptions differ from regular branches in that the target address is not specified by the instruction itself, and the branch decision is dependent on the outcome of the instruction.
 
The most common kind of software-visible exception on one of the classic RISC machines is a [[Translation lookaside buffer#TLB-miss handling|''TLB miss'' (see [[virtual memory]]).
 
Exceptions are different from branches and jumps, because those other control flow changes are resolved in the decode stage. Exceptions are resolved in the writeback stage. When an exception is detected, the following instructions (earlier in the pipeline) are marked as invalid, and as they flow to the end of the pipe their results are discarded. The program counter is set to the address of a special exception handler, and special registers are written with the exception ___location and cause.
 
To make it easy (and fast) for the software to fix the problem and restart the program, the CPU must take a precise exception. A precise exception means that all instructions up to the excepting instruction have been executed, and the excepting instruction and everything afterwards have not been executed.
Line 154 ⟶ 158:
==Cache miss handling==
 
Occasionally, either the data or instruction cache doesn'tdoes not contain a required datum or instruction. In these cases, the CPU must suspend operation until the cache can be filled with the necessary data, and then must resume execution. The problem of filling the cache with the required data (and potentially writing back to memory the evicted cache line) is not specific to the pipeline organization, and is not discussed here.
 
There are two strategies to handle the suspend/resume problem. The first is a global stall signal. This signal, when activated, prevents instructions from advancing down the pipeline, generally by gating off the clock to the flip-flops at the start of each stage. The disadvantage of this strategy is that there are a large number of flip flops, so the global stall signal takes a long time to propagate. Since the machine generally has to stall in the same cycle that it identifies the condition requiring the stall, the stall signal becomes a speed-limiting critical path.
 
Another strategy to handle suspend/resume is to reuse the exception logic. The machine takes an exception on the offending instruction, and all further instructions are invalidated. When the cache has been filled with the necessary data, the instruction that caused the cache miss restarts. To expedite data cache miss handling, the instruction can be restarted so that its access cycle happens one cycle after the data cache is filled.
 
== See also ==
 
* [[Iron law of processor performance]]
 
== References ==
{{Reflist}}
{{refbegin}}
* {{cite book |firstfirst1=John L. |lastlast1=Hennessy |first2=David A. |last2=Patterson |title=Computer Architecture, A Quantitative Approach |publisher=Morgan Kaufmann |edition=5th |year=2011 |isbn=978-0123838728 |pages= |url=https://books.google.com/books?id=v3-1hVwHnHwC&printsec=frontcover}}
{{refend}}