Asanovic/Devadas
Spring 2002
6.823
Simple Instruction
Pipelining
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Asanovic/Devadas
Spring 2002Processor Performance 6.823
Equation
Time = Instructions * Cycles * Time
Program Program Instruction Cycle
� Instructions per program depends on source code,
compiler technology, and ISA
� Microcoded DLX from last lecture had cycles per
instruction (CPI) of around 7 minimum
� Time per cycle for microcoded DLX fixed by
microcode cycle time
— mostly ROM access + next µPC select logic
Asanovic/Devadas
Spring 2002
6.823
Pipelined DLX
To pipeline DLX:
� First build unpipelined DLX with CPI=1
� Next, add pipeline registers to reduce
cycle time while maintaining CPI=1
Asanovic/Devadas
Spring 2002
6.823
A Simple Memory Model
WriteEnable
Address
ReadData
WriteData
MAGIC
RAM
Clock
Reads and writes are always completed in one cycle
� a Read can be done any time (i.e. combinational)
� a Write is performed at the rising clock edge
if it is enabled
⇒ the write address, data, and enable
must be stable at the clock edge
Asanovic/Devadas
Spring 2002
6.823
Datapath for ALU Instructions
OpCode
rd1
rs1
rs2
rd2
inst<25:21>
inst<20:16>
GPRs
z
ALU
inst<15:0>
inst<31:26> <5:0>
ExtSel OpSel
ALU
Control
Imm
Ext
BSrc
clk
inst<15:11>
RegWrite
ws
wd
we
RegDst
0x4
Add
clk
addr
inst
Inst.
Memory
PC
0x4
Add
clk
addr
inst
Inst.
Memory
PC
rf2 / rf3 Reg / Imm
6 5 5 5 5 6
rf3 ← (rf1) func (rf2)
opcode rf1 rf2 immediate rf2 ← (rf1) op immediate
0 rf1 rf2 func 0 rf3
Asanovic/DevadasDatapath for Memory Spring 2002 6.823
Instructions
Should program and data memory be separate?
Harvard style: separate (Aiken and Mark 1 influence)
- read-only program memory
- read/write data memory
at some level the two memories have
to be the same
Princeton style: the same (von Neumann’s influence)
- A Load or Store instruction requires
accessing the memory more than once
during its execution
Asanovic/DevadasLoad/Store Instructions: Spring 2002 6.823
Harvard-Style Datapath
RegWrite MemWrite
0x4
Add
clk
addr
inst
Inst.
Memory
PC
WBSrc
ALU / Mem
“base”
disp
ALU
Control
z
ALU
clk
addr
wdata
rdataData
Memory
we
clk
rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
OpCode RegDst ExtSel OpSel BSrc
6 5 5 16 addressing mode
(rf1) + displacementopcode rf1 rf2 ent displacem
31 26 25 21 20 16 15 0
rf1 is the base register
rf2 is the destination of a Load or the source for a Store
Asanovic/Devadas
Spring 2002
Memory Hierarchy c.2002 6.823
Desktop & Small Server
Proc
I$
D$
0.5-2ns
2-3 clk
8~64KB
On-chip Caches
L2
<10ns
5~15 clk
0.25-2MB
SRAM/
eDRAM
Interleaved
Banks of DRAM
Hard DiskOff-chip
L3 Cache
< 25ns
15~50 clk
1~8MB
~150ns
100~300 clk
64M~1GB
~10ms seek time
~107 clk
20~100GB
Our memory model is a good approximation of
the hierarchical memory system when we hit in
the on-chip cache
Asanovic/Devadas
Spring 2002Conditional Branches 6.823
PCSrc ( ~j / j ) RegWrite MemWrite WBSrc
0x4
clk
RegDst BSrcExtSelOpCode
Add
z
Add
OpSel
clk
zero?
clk
addr
inst
Inst.
Memory
PC rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdata
Data
Memory
we
ALU
ALU
Control
Asanovic/Devadas
Spring 2002Register-Indirect Jumps 6.823
PCSrc ( ~j / j RInd / j PCR ) RegWrite MemWrite WBSrc
0x4
clk
Add
z
Add
clk
Jump & Link?
clk
addr
inst
Inst.
Memory
PC rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdataData
Memory
we
ALU
ALU
Control
OpCode RegDst ExtSel OpSel BSrc zero?
Asanovic/Devadas
Spring 2002
Jump & Link 6.823
0x4
clk
RegDst
RegWrite
BSrc zero?
WBSrc
ALU / Mem / PC
31
ExtSelOpCode
Add
z
Add
OpSel
clk
MemWritePCSrc
clk
addr
inst
Inst.
Memory
PC rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdataData
Memory
we
ALU
ALU
Control
rf3 / rf2 / R31
Asanovic/Devadas
Spring 2002PC-Relative Jumps 6.823
RegWrite MemWrite WBSrc
0x4
clk
RegDst
31
OpCode
Add
z
Add
clk
PCSrc
clk
addr
inst
Inst.
Memory
PC rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdataData
Memory
we
ALU
No new
datapath
required
ALU
Control
ExtSel OpSel
BSrc zero?Ext16 / Ext26
Asanovic/Devadas
Spring 2002
6.823
Hardwired Control is pure
Combinational Logic:
Unpipelined DLX
op code
zero?
combinational
logic
ExtSel
BSrc
OpSel
MemWrite
WBSrc
RegDst
RegWrite
PCSrc
Asanovic/Devadas
Spring 2002ALU Control & Immediate 6.823
Extension
Inst<31:26> (Opcode)
Decode Map
Inst<5:0> (Func)
ALUop
0?
+
OpSel
( Func, Op, +, 0? )
ExtSel
( sExt16, uExt16,
sExt26, High16)
Asanovic/Devadas
Spring 2002Hardwired Control worksheet 6.823
PCSrc RegWrite MemWrite WBSrc
0x4
clk
PCR / RInd / ~j ALU / Mem
/ PC
inst<25:21>
inst<20:16>
inst<15:11>
inst<25:0>
inst<31:26><5:0>
31
0x4
Add
z
Add
RegDst
BSrc
Reg / Imm
zero?
OpCode
clk
Add
clk
addr
inst
Inst.
Memory
PC rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdataData
Memory
we
ALU
ALU
Control
ExtSel OpSelrf2 / rf3 / sExt16/uExt16/ Func/R31 sExt26/High16 Op/+ / 0?
Asanovic/DevadasHardwired Control Table Spring 2002 6.823
BSrc = Reg / Imm WBSrc = ALU / Mem / PC RegDst = rf2 / rf3 / R31
PCSrc1 = j / ~j PCSrc2 = PCR / RInd
JR
JALR
JAL
J
BEQZz=1
BEQZz=0
SW
LW
ALUiu
ALUi
ALUu
ALU
PC
Src
Reg
Dest
WB
Src
Reg
Write
Mem
Write
Op
Sel
BSrcExt
Sel
Opcode
* * o yes RIndPC R31
RInd* * o no * *
PCRsExt26 * * no yes PC R31
PCRsExt26 * * no no * *
~jsExt16 * ? no no * *
PCRsExt16 * ? no no * *
~jsExt16 Imm + yes no * *
~jImm Op no yes ALU rf2
~j* Reg Func no yes ALU rf3
~j* Reg Func no yes ALU rf3
~jsExt16 Imm Op no yes ALU rf2
~jsExt16 Imm + no yes Mem rf2
uExt16
* n
* n
0
0
Asanovic/Devadas
Spring 2002
6.823
Hardwired Unpipelined Machine
� Simple
� One instruction per cycle
� Why wasn’t this a popular machine style?
Asanovic/Devadas
Spring 2002
6.823
Unpipelined DLX
Clock period must be sufficiently long for all of the
following steps to be “completed”:
1. instruction fetch
2. decode and register fetch
3. ALU operation
4. data fetch if required
5. register write-back setup time
⇒ tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB
� At the rising edge of the following clock, the
PC, the register file and the memory are updated
Asanovic/Devadas
Spring 2002Pipelined DLX Datapath 6.823
addr
wdata
rdata
Data
Memory
we
ALU
Imm
Ext
PC
0x4
Add
addr
rdata
Inst.
Memory
rd1
GPRs
rs1
rs2
ws
wd rd2
we
write
-back
phase
fetch
phase
execute
phase
decode & Reg-fetch
phase
memory
phase
IR
PC
Clock period can be reduced by dividing the execution
of an instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} = tDM (probably)
However, CPI will increase unless instructions
are pipelined
Asanovic/Devadas
Spring 2002An Ideal Pipeline 6.823
stage stage stage
2 3 4
stage
1
� All objects go through the same stages
� No sharing of resources between any two stages
� Propagation delay through all pipeline stages is equal
� The scheduling of an object entering the pipeline is
not affected by the objects in other stages
These conditions generally hold for industrial
assembly lines. An instruction pipeline, however,
cannot satisfy the last condition. Why?
Asanovic/Devadas
Spring 2002
6.823
Pipelining History
� Some very early machines had limited pipelined
execution (e.g., Zuse Z4, WWII)
� Usually overlap fetch of next instruction with current execution
� IBM Stretch first major “supercomputer”
incorporating extensive pipelining, result
bypassing, and branch prediction
� project started in 1954, delivered in 1961
� didn’t meet initial performance goal of 100x faster with 10x
faster circuits
� up to 11 macroinstructions in pipeline at same time
� microcode engine highly pipelined also (up to 6
microinstructions in pipeline at same time)
� Stretch was origin of 8-bit byte and lower case characters,
carried on into IBM 360
Asanovic/Devadas
Spring 2002
6.823
How to divide the datapath
into stages
Suppose memory is significantly slower than other
stages. In particular, suppose
tIM = tDM = 10 units
tALU = 5 units
tRF = tRW = 1 unit
Since the slowest stage determines the clock, it may
be possible to combine some stages without any loss
of performance
Asanovic/Devadas
Spring 2002
6.823
Minimizing Critical Path
write
-back
phase
fetch
phase
decode & Reg-fetch
phase
memory
phase
addr
wdata
rdata
Data
Memory
we
ALU
Imm
Ext
PC
0 x4
Add
addr
rdata
Inst.
Memory
rd1
GPRs
rs1
rs2
ws
wd rd2
we
IR
& execute
tC > max {tIM, tRF + tALU, tDM, tRW}
Write-back stage takes much less time than other stages.
Suppose we combined it with the memory phase
⇒ increase the critical path by 10%
Asanovic/DevadasMaximum Speedup by Spring 2002 6.823
Pipelining
For the 4-stage pipeline, given
tIM = tDM = 10 units, tALU = 5 units, tRF = tRW= 1 unit
tC could be reduced from 27 units to 10 units ⇒ speedup = 2.7
However, if tIM = tDM = tALU = tRF = tRW = 5 units
The same 4-stage pipeline can reduce tC from 25 units to
10 units
⇒ speedup = 2.5
But, since tIM = tDM = tALU = tRF = tRW, it is possible to
achieve higher speedup with more stages in the pipeline.
A 5-stage pipeline can reduce tC from 25 units to
5 units
⇒ speedup = 5
Asanovic/Devadas
Spring 2002
6.823
Technology Assumptions
We will assume
• A small amount of very fast memory (caches)
backed up by a large, slower memory
• Fast ALU (at least for integers)
• Multiported Register files (slower!).
It makes the following timing assumption valid
tIM ≈ tRF ≈ tALU ≈ tDM ≈ tRW
A 5-stage pipelined Harvard-style architecture will
be the focus of our detailed design
Asanovic/Devadas5-Stage Pipelined Execution Spring 2002 6.823
fetch
PC
0x4
Add
addr
wdata
rdata
Memory
we
executedecode & Reg-fetch memory
IR
rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdata
Memory
we
ALU
write
-back
phase phase phase phase phase
(IF) (ID) (EX) (MA) (WB)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
instruction1 IF1 ID1 EX1 MA1 WB1
instruction2 IF2 ID2 EX2 MA2 WB2
instruction3 IF3 ID3 EX3 MA3 WB3
instruction4 IF4 ID4 EX4 MA4 WB4
instruction5 IF5 ID5 EX5 MA5 WB5
Asanovic/Devadas5-Stage Pipelined Execution Spring 2002 6.823
Resource Usage Diagram
fetch
PC
0x4
Add
addr
wdata
rdata
Memory
we
executedecode & Reg-fetch memory
IR
rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdata
Memory
we
ALU
write
-back
phase phase phase phase phase
(IF) (ID) (EX) (MA) (WB)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 I2 I3 I4 I5
ID I1 I2 I3 I4 I5
EX I1 I2 I3 I4 I5
MA I1 I2 I3 I4 I5
WB I1 I2 I3 I4 I5
R
e
s
o
u
r
c
e
s
Asanovic/DevadasPipelined Execution: Spring 2002 6.823
ALU Instructions
not quite correct!
31PC A
B
Y
R
MD1 MD2
addr
inst
Inst
Memory
0x4
Add
IR
rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdata
Data
Memory
we
ALU
Asanovic/DevadasPipelined Execution: Spring 2002 6.823
Need for Several IR’s
31
IR IR IR
PC A
B
Y
R
MD1 MD2
addr
inst
Inst
Memory
0x4
Add
IR
rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdata
Data
Memory
we
ALU
Asanovic/Devadas
Spring 2002
6.823
IRs and Control points
31
IR IR IR
PC A
B
Y
R
MD1 MD2
addr
inst
Inst
Memory
0x4
Add
IR
rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdata
Data
Memory
we
ALU
Are control points connected properly?
- Load/Store instructions
- ALU instructions
Asanovic/Devadas
Spring 2002
Pipelined DLX Datapath 6.823
without jumps
31
IR IR IR
PC A
B
Y
R
MD1 MD2
addr
inst
Inst
Memory
0x4
Add
IR
rd1
GPRs
rs1
rs2
ws
wd rd2
we
Imm
Ext
addr
wdata
rdata
Data
Memory
we
ALU
ExtSel
OpSel
BSrc
RegDst
WBSrc
MemWrite
RegWrite
Asanovic/Devadas
Spring 2002
6.823
How Instructions can Interact
with each other in a Pipeline
• An instruction in the pipeline may need a resource
being used by another instruction in the pipeline
structural hazard
• An instruction may produce data that is needed by
a later instruction
data hazard
• In the extreme case, an instruction may determine
the next instruction to be executed
control hazard (branches, interrupts,...)
Asanovic/Devadas
Spring 2002
6.823
Feedback to Resolve Hazards
stage
1
stage
2
stage
3
stage
4
FB1 FB2 FB3 FB4
Controlling pipeline in this manner works provided
the instruction at stage i+1 can complete without
any interference from instructions in stages 1 to i
(otherwise deadlocks may occur)
Feedback to previous stages is used to stall or kill
instructions
本文档为【lecture05】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。