首页 lecture05

lecture05

举报
开通vip

lecture05 Asanovic/Devadas Spring 2002 6.823 Simple Instruction Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring 2002Processor Performance 6.823 Equation Time = Instructions * Cyc...

lecture05
Asanovic/Devadas Spring 2002 6.823 Simple Instruction Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring 2002Processor Performance 6.823 Equation Time = Instructions * Cycles * Time Program Program Instruction Cycle �‰ Instructions per program depends on source code, compiler technology, and ISA �‰ Microcoded DLX from last lecture had cycles per instruction (CPI) of around 7 minimum �‰ Time per cycle for microcoded DLX fixed by microcode cycle time — mostly ROM access + next µPC select logic Asanovic/Devadas Spring 2002 6.823 Pipelined DLX To pipeline DLX: �‰ First build unpipelined DLX with CPI=1 �‰ Next, add pipeline registers to reduce cycle time while maintaining CPI=1 Asanovic/Devadas Spring 2002 6.823 A Simple Memory Model WriteEnable Address ReadData WriteData MAGIC RAM Clock Reads and writes are always completed in one cycle �ƒ a Read can be done any time (i.e. combinational) �ƒ a Write is performed at the rising clock edge if it is enabled ⇒ the write address, data, and enable must be stable at the clock edge Asanovic/Devadas Spring 2002 6.823 Datapath for ALU Instructions OpCode rd1 rs1 rs2 rd2 inst<25:21> inst<20:16> GPRs z ALU inst<15:0> inst<31:26> <5:0> ExtSel OpSel ALU Control Imm Ext BSrc clk inst<15:11> RegWrite ws wd we RegDst 0x4 Add clk addr inst Inst. Memory PC 0x4 Add clk addr inst Inst. Memory PC rf2 / rf3 Reg / Imm 6 5 5 5 5 6 rf3 ← (rf1) func (rf2) opcode rf1 rf2 immediate rf2 ← (rf1) op immediate 0 rf1 rf2 func 0 rf3 Asanovic/DevadasDatapath for Memory Spring 2002 6.823 Instructions Should program and data memory be separate? Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory at some level the two memories have to be the same Princeton style: the same (von Neumann’s influence) - A Load or Store instruction requires accessing the memory more than once during its execution Asanovic/DevadasLoad/Store Instructions: Spring 2002 6.823 Harvard-Style Datapath RegWrite MemWrite 0x4 Add clk addr inst Inst. Memory PC WBSrc ALU / Mem “base” disp ALU Control z ALU clk addr wdata rdataData Memory we clk rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext OpCode RegDst ExtSel OpSel BSrc 6 5 5 16 addressing mode (rf1) + displacementopcode rf1 rf2 ent displacem 31 26 25 21 20 16 15 0 rf1 is the base register rf2 is the destination of a Load or the source for a Store Asanovic/Devadas Spring 2002 Memory Hierarchy c.2002 6.823 Desktop & Small Server Proc I$ D$ 0.5-2ns 2-3 clk 8~64KB On-chip Caches L2 <10ns 5~15 clk 0.25-2MB SRAM/ eDRAM Interleaved Banks of DRAM Hard DiskOff-chip L3 Cache < 25ns 15~50 clk 1~8MB ~150ns 100~300 clk 64M~1GB ~10ms seek time ~107 clk 20~100GB Our memory model is a good approximation of the hierarchical memory system when we hit in the on-chip cache Asanovic/Devadas Spring 2002Conditional Branches 6.823 PCSrc ( ~j / j ) RegWrite MemWrite WBSrc 0x4 clk RegDst BSrcExtSelOpCode Add z Add OpSel clk zero? clk addr inst Inst. Memory PC rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdata Data Memory we ALU ALU Control Asanovic/Devadas Spring 2002Register-Indirect Jumps 6.823 PCSrc ( ~j / j RInd / j PCR ) RegWrite MemWrite WBSrc 0x4 clk Add z Add clk Jump & Link? clk addr inst Inst. Memory PC rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdataData Memory we ALU ALU Control OpCode RegDst ExtSel OpSel BSrc zero? Asanovic/Devadas Spring 2002 Jump & Link 6.823 0x4 clk RegDst RegWrite BSrc zero? WBSrc ALU / Mem / PC 31 ExtSelOpCode Add z Add OpSel clk MemWritePCSrc clk addr inst Inst. Memory PC rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdataData Memory we ALU ALU Control rf3 / rf2 / R31 Asanovic/Devadas Spring 2002PC-Relative Jumps 6.823 RegWrite MemWrite WBSrc 0x4 clk RegDst 31 OpCode Add z Add clk PCSrc clk addr inst Inst. Memory PC rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdataData Memory we ALU No new datapath required ALU Control ExtSel OpSel BSrc zero?Ext16 / Ext26 Asanovic/Devadas Spring 2002 6.823 Hardwired Control is pure Combinational Logic: Unpipelined DLX op code zero? combinational logic ExtSel BSrc OpSel MemWrite WBSrc RegDst RegWrite PCSrc Asanovic/Devadas Spring 2002ALU Control & Immediate 6.823 Extension Inst<31:26> (Opcode) Decode Map Inst<5:0> (Func) ALUop 0? + OpSel ( Func, Op, +, 0? ) ExtSel ( sExt16, uExt16, sExt26, High16) Asanovic/Devadas Spring 2002Hardwired Control worksheet 6.823 PCSrc RegWrite MemWrite WBSrc 0x4 clk PCR / RInd / ~j ALU / Mem / PC inst<25:21> inst<20:16> inst<15:11> inst<25:0> inst<31:26><5:0> 31 0x4 Add z Add RegDst BSrc Reg / Imm zero? OpCode clk Add clk addr inst Inst. Memory PC rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdataData Memory we ALU ALU Control ExtSel OpSelrf2 / rf3 / sExt16/uExt16/ Func/R31 sExt26/High16 Op/+ / 0? Asanovic/DevadasHardwired Control Table Spring 2002 6.823 BSrc = Reg / Imm WBSrc = ALU / Mem / PC RegDst = rf2 / rf3 / R31 PCSrc1 = j / ~j PCSrc2 = PCR / RInd JR JALR JAL J BEQZz=1 BEQZz=0 SW LW ALUiu ALUi ALUu ALU PC Src Reg Dest WB Src Reg Write Mem Write Op Sel BSrcExt Sel Opcode * * o yes RIndPC R31 RInd* * o no * * PCRsExt26 * * no yes PC R31 PCRsExt26 * * no no * * ~jsExt16 * ? no no * * PCRsExt16 * ? no no * * ~jsExt16 Imm + yes no * * ~jImm Op no yes ALU rf2 ~j* Reg Func no yes ALU rf3 ~j* Reg Func no yes ALU rf3 ~jsExt16 Imm Op no yes ALU rf2 ~jsExt16 Imm + no yes Mem rf2 uExt16 * n * n 0 0 Asanovic/Devadas Spring 2002 6.823 Hardwired Unpipelined Machine �‰ Simple �‰ One instruction per cycle �‰ Why wasn’t this a popular machine style? Asanovic/Devadas Spring 2002 6.823 Unpipelined DLX Clock period must be sufficiently long for all of the following steps to be “completed”: 1. instruction fetch 2. decode and register fetch 3. ALU operation 4. data fetch if required 5. register write-back setup time ⇒ tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB �‰ At the rising edge of the following clock, the PC, the register file and the memory are updated Asanovic/Devadas Spring 2002Pipelined DLX Datapath 6.823 addr wdata rdata Data Memory we ALU Imm Ext PC 0x4 Add addr rdata Inst. Memory rd1 GPRs rs1 rs2 ws wd rd2 we write -back phase fetch phase execute phase decode & Reg-fetch phase memory phase IR PC Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} = tDM (probably) However, CPI will increase unless instructions are pipelined Asanovic/Devadas Spring 2002An Ideal Pipeline 6.823 stage stage stage 2 3 4 stage 1 �‰ All objects go through the same stages �‰ No sharing of resources between any two stages �‰ Propagation delay through all pipeline stages is equal �‰ The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. An instruction pipeline, however, cannot satisfy the last condition. Why? Asanovic/Devadas Spring 2002 6.823 Pipelining History �‰ Some very early machines had limited pipelined execution (e.g., Zuse Z4, WWII) �ƒ Usually overlap fetch of next instruction with current execution �‰ IBM Stretch first major “supercomputer” incorporating extensive pipelining, result bypassing, and branch prediction �ƒ project started in 1954, delivered in 1961 �ƒ didn’t meet initial performance goal of 100x faster with 10x faster circuits �ƒ up to 11 macroinstructions in pipeline at same time �ƒ microcode engine highly pipelined also (up to 6 microinstructions in pipeline at same time) �ƒ Stretch was origin of 8-bit byte and lower case characters, carried on into IBM 360 Asanovic/Devadas Spring 2002 6.823 How to divide the datapath into stages Suppose memory is significantly slower than other stages. In particular, suppose tIM = tDM = 10 units tALU = 5 units tRF = tRW = 1 unit Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance Asanovic/Devadas Spring 2002 6.823 Minimizing Critical Path write -back phase fetch phase decode & Reg-fetch phase memory phase addr wdata rdata Data Memory we ALU Imm Ext PC 0 x4 Add addr rdata Inst. Memory rd1 GPRs rs1 rs2 ws wd rd2 we IR & execute tC > max {tIM, tRF + tALU, tDM, tRW} Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase ⇒ increase the critical path by 10% Asanovic/DevadasMaximum Speedup by Spring 2002 6.823 Pipelining For the 4-stage pipeline, given tIM = tDM = 10 units, tALU = 5 units, tRF = tRW= 1 unit tC could be reduced from 27 units to 10 units ⇒ speedup = 2.7 However, if tIM = tDM = tALU = tRF = tRW = 5 units The same 4-stage pipeline can reduce tC from 25 units to 10 units ⇒ speedup = 2.5 But, since tIM = tDM = tALU = tRF = tRW, it is possible to achieve higher speedup with more stages in the pipeline. A 5-stage pipeline can reduce tC from 25 units to 5 units ⇒ speedup = 5 Asanovic/Devadas Spring 2002 6.823 Technology Assumptions We will assume • A small amount of very fast memory (caches) backed up by a large, slower memory • Fast ALU (at least for integers) • Multiported Register files (slower!). It makes the following timing assumption valid tIM ≈ tRF ≈ tALU ≈ tDM ≈ tRW A 5-stage pipelined Harvard-style architecture will be the focus of our detailed design Asanovic/Devadas5-Stage Pipelined Execution Spring 2002 6.823 fetch PC 0x4 Add addr wdata rdata Memory we executedecode & Reg-fetch memory IR rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdata Memory we ALU write -back phase phase phase phase phase (IF) (ID) (EX) (MA) (WB) time t0 t1 t2 t3 t4 t5 t6 t7 . . . . instruction1 IF1 ID1 EX1 MA1 WB1 instruction2 IF2 ID2 EX2 MA2 WB2 instruction3 IF3 ID3 EX3 MA3 WB3 instruction4 IF4 ID4 EX4 MA4 WB4 instruction5 IF5 ID5 EX5 MA5 WB5 Asanovic/Devadas5-Stage Pipelined Execution Spring 2002 6.823 Resource Usage Diagram fetch PC 0x4 Add addr wdata rdata Memory we executedecode & Reg-fetch memory IR rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdata Memory we ALU write -back phase phase phase phase phase (IF) (ID) (EX) (MA) (WB) time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I1 I2 I3 I4 I5 ID I1 I2 I3 I4 I5 EX I1 I2 I3 I4 I5 MA I1 I2 I3 I4 I5 WB I1 I2 I3 I4 I5 R e s o u r c e s Asanovic/DevadasPipelined Execution: Spring 2002 6.823 ALU Instructions not quite correct! 31PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdata Data Memory we ALU Asanovic/DevadasPipelined Execution: Spring 2002 6.823 Need for Several IR’s 31 IR IR IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdata Data Memory we ALU Asanovic/Devadas Spring 2002 6.823 IRs and Control points 31 IR IR IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdata Data Memory we ALU Are control points connected properly? - Load/Store instructions - ALU instructions Asanovic/Devadas Spring 2002 Pipelined DLX Datapath 6.823 without jumps 31 IR IR IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext addr wdata rdata Data Memory we ALU ExtSel OpSel BSrc RegDst WBSrc MemWrite RegWrite Asanovic/Devadas Spring 2002 6.823 How Instructions can Interact with each other in a Pipeline • An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard • An instruction may produce data that is needed by a later instruction data hazard • In the extreme case, an instruction may determine the next instruction to be executed control hazard (branches, interrupts,...) Asanovic/Devadas Spring 2002 6.823 Feedback to Resolve Hazards stage 1 stage 2 stage 3 stage 4 FB1 FB2 FB3 FB4 Controlling pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) Feedback to previous stages is used to stall or kill instructions
本文档为【lecture05】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_098539
暂无简介~
格式:pdf
大小:319KB
软件:PDF阅读器
页数:33
分类:互联网
上传时间:2011-03-18
浏览量:20