Revision as of 17:32, 11 May 2020 edit DannyS712 bot (talk \| contribs) Bots 133,338 edits m Task 70: Update syntaxhighlight tags - remove use of deprecated <source> tags Tag: AWB ← Previous edit		Revision as of 11:43, 16 May 2020 edit undo Kshalle (talk \| contribs) 35 edits Changed from past tense "was" to definite "is" to remove unnecessary confusion. These 5 stage pipelines still exist. Mentally difficult to parse the "was" form. Added important details about hazards to demystify them. Removed incorrect "PC predictor" Next edit →
Line 5: In the [[history of computing hardware\|history of computer hardware]], some early [[reduced instruction set computer]] [[central processing unit]]s (RISC CPUs) used a very similar architectural solution, now called a '''classic RISC pipeline'''. Those CPUs were: [[MIPS architecture\|MIPS]], [[SPARC]], Motorola [[Motorola 88000\|88000]], and later the notional CPU [[DLX]] invented for education. Each of these classic scalar RISC designs ~~fetched~~fetches and ~~tried~~tries to execute [[Instructions per cycle\|one instruction per cycle]]. The main common concept of each design ~~was~~is a five-stage execution [[instruction pipeline]]. During operation, each pipeline stage ~~worked~~works on one instruction at a time. Each of these stages ~~consisted~~consists of ~~an initial~~a set of [[flip-flop (electronics)\|flip-flops]] to hold state, and [[combinational logic]] that ~~operated~~operates on the outputs of those flip-flops. ==The classic five stage RISC pipeline== Line 11: ===Instruction fetch=== The instructions reside in memory that takes one cycle to read. This memory can be dedicated SRAM, or an Instruction [[Cache (computing)\|Cache]]. on ~~these~~The ~~machines~~term ~~had~~"latency" ais ~~latency~~used ofin ~~one~~computer ~~cycle~~science often, ~~meaning~~and ~~that if~~means the ~~instruction~~time ~~was~~from inwhen ~~the~~an ~~cache,~~operation starts until it ~~would~~completes. be ~~ready~~Thus, oninstruction ~~the~~fetch ~~next~~has a latency of one [[clock cycle]] (if using single cycle SRAM or if the instruction was in the cache). ~~During~~ Thus, during the [[Instruction fetch\|Instruction Fetch]] stage, a 32-bit instruction ~~was~~is fetched from the ~~cache~~instruction memory. The [[Program Counter]], or PC, is a register that holds the address ofthat ~~the~~is ~~current~~presented to the instruction memory. ItAt ~~feeds~~the ~~into~~start ~~the~~of PCa ~~predictor~~cycle, ~~which~~the ~~then~~address ~~sends~~is ~~the~~presented ~~[[Program~~to ~~Counter]]~~instruction ~~(PC)~~memory. to Then during the ~~Instruction~~cycle, ~~Cache~~the toinstruction is being read ~~the~~out ~~current~~of instruction. memory, Atand at the same time, ~~the~~a PCcalculation ~~predictor~~is ~~predicts~~done to determine the ~~address~~next PC. The calculation of the next ~~instruction~~PC is done by incrementing the PC by 4, and by choosing whether to take that as the next PC or alternatively to take the result of a branch / jump calculation as the next PC. Note that in classic RISC, (all instructions ~~were~~have 4the ~~bytes~~same ~~long)~~length. (This ~~prediction~~is ~~was~~one thing that separates RISC from CISC <ref>{{cite paper \|first=David \|last=Patterson\| title=RISC I: A Reduced Instruction Set VLSI Computer \|url=https://dl.acm.org/doi/10.5555/800052.801895}}</ref>). In the original RISC designs, the size of an instruction is 4 bytes, so always ~~wrong~~add 4 to the instruction address, but don't use PC + 4 for in the case of a taken branch, jump, or exception (see '''delayed branches''', below). ~~Later~~(Note ~~machines~~that ~~would~~some modern machines use more complicated ~~and accurate~~ algorithms ([[branch prediction]] and [[branch target predictor\|branch target prediction]]) to guess the next instruction address.) ===Instruction decode=== ~~Unlike~~Another thing that separates the first RISC machines from earlier ~~microcoded~~CISC machines, ~~the~~is ~~first~~that ~~RISCmachines~~RISC ~~had~~has no [[microcode]] <ref>{{cite paper \|first=David \|last=Patterson\| title=RISC I: A Reduced Instruction Set VLSI Computer \|url=https://dl.acm.org/doi/10.5555/800052.801895}}</ref>. ~~Once~~In the case of CISC micro-coded instructions, once fetched from the instruction cache, the instruction bits ~~were~~are shifted down the pipeline, ~~so that~~where simple combinational logic in each pipeline stage ~~could produce the~~produces control signals for the datapath directly from the instruction bits. AsIn athose ~~result~~CISC designs, very little decoding is done in the stage traditionally called the decode stage. A consequence of this lack of decoding ~~meant however~~is that more instruction bits ~~had~~have to be used to specifying what the instruction ~~should~~does. ~~do (and also, what it should not), and that~~That leaves fewer bits for things like register indices. All MIPS, SPARC, and DLX instructions have at most two register inputs. During the decode stage, the indexes of these two ~~register names~~registers are identified within the instruction, and the indexes are presented to the register memory, as the address. Thus the two registers named are read from the [[register file]]. In the MIPS design, the register file had 32 entries. At the same time the register file ~~was~~is read, instruction issue logic in this stage ~~determined~~determines if the pipeline ~~was~~is ready to execute the instruction in this stage. If not, the issue logic ~~would cause~~causes both the Instruction Fetch stage and the Decode stage to stall. On a stall cycle, the ~~stages~~input ~~would~~flip ~~prevent~~flops ~~their~~do ~~initial~~not ~~flip-flops~~accept ~~from~~new ~~accepting~~bits, thus no new ~~bits~~calculations take place during that cycle. If the instruction decoded ~~was~~is a branch or jump, the target address of the branch or jump ~~was~~is computed in parallel with reading the register file. The branch condition is computed in the following cycle (after the register file is read), and if the branch is taken or if the instruction is a jump, the PC ~~predictor~~ in the first stage is assigned the branch target, rather than the incremented PC that has been computed. Some architectures made use of the [[Arithmetic logic unit\|ALU]] in the Execute stage, at the cost of slightly decreased instruction throughput. The decode stage ended up with quite a lot of hardware: MIPS ~~had~~has the possibility of branching if two registers ~~were~~are equal, so a 32-bit-wide AND tree ~~ran~~runs in series after the register file read, making a very long critical path through this stage (which means fewer cycles per second). Also, the branch target computation generally required a 16 bit add and a 14 bit incrementer. Resolving the branch in the decode stage made it possible to have just a single-cycle branch ~~mispredict~~mis-predict penalty. Since branches were very often taken (and thus ~~mispredicted~~mis-predicted), it was very important to keep this penalty low. ===Execute=== Line 50: During this stage, both single cycle and two cycle instructions write their results into the register file. Note that two different stages are accessing the register file at the same time -- the decode stage is reading two source registers, at the same time that the writeback stage is writing a previous instruction's destination register. On real silicon, this can be a hazard (see below for more on hazards). That is because one of the source registers being read in decode might be the same as the destination register being written in writeback. When that happens, then the same memory cells in the register file are being both read and written the same time. On silicon, many implementations of memory cells will not operate correctly when read and written at the same time. ==Hazards==

Classic RISC pipeline: Difference between revisions