page: https://www.student.cs.uwaterloo.ca/~cs251/S20/index.shtml
Instructor: Rosina Kharal, Mohammad Mazaheri Kalahrody
Q1-Q12, A0-A6, T1-T3
defn. assembly language is any low-level programming language with a very strong correspondence between the instructions in the language and the architecture's machine code instructions.
compiler(cs241) assembler(cs251)
high level language -> assembly -> binary machine code
+---------------------+
| main program |
+------------------------------------------------------+ | |
| CPU main processor | | |
| | | |
| +----------+ +-----------+ | |+-------------------+|
| | | | | | ||my program code data|
| | | | | | || ||
| |instruction +---------------+ |data | | ||000 ... ||
| |memory | | register file | |memory | | ||008 ... ||
| | | |X0 2.fetch | | | | || ||
| |1.all code| |X1 LDUR and | |1.all data | | || ||
| |is copied | | execute.. | |is copied | | || ||
| |here: | | 4.fetch | |here | | || ||
| | | | ADD and exec. | | | || ||
| |000 LDUR .. | 5.fetch STUR| |200 arr[0] | | || ||
| |004 ADD ..| | and exec | | | | || ||
| | | | | |3.data | | || ||
| | | |X31/XZR | |accessed | | || ||
| | | +---------------+ |here | | || ||
| | | |6.data | | || ||
| | | |stored here| | || ||
| +----------+ +-----------+ | || ||
| | |+-------------------+|
+------------------------------------------------------+ | |
| |
+---------------------+
fetch-execute cycle:
each register holds max 64 bits. there are 32 registers: X0,X1,...,X31/XZR
eg. exec in register
// assume values are preloaded into X... registers
// X1: f, X2: g, X3: h, X4: i, X5: j
// f = (g + h) - (i + j)
// equivalent ARM:
// dest lhs rhs
ADD X6, X2, X3 ;; X2, X3 not changed
ADD X7, X4, X5 ;; X6, X7 hold temp value
SUB X1, X6, X7 ;; X1 has the final value
// then put value in X1 to f
types of arm instructions:
ADD X1, X2, X3
adds content of X2 to contents of X3; store result in X1LDUR X1, [X2, #20]
load data from address X2+20 into X1STUR X1, [X2, #30]
store data from X1 into address X2+30ADDI X1, X2, #100
adds immediate value 100 to contents of X2; store result in X1B #28
go to 28 instructions laterCBZ X1, #8
if X1 is 0, then go to 8 instructions later, else next instructionthere are instruction memory and data memory.
each instruction requires 32 bits (4 bytes)
eg. program in memory
// PC instruction
000: ADD X1,X2,X3
004: SUB X1,X3,X5
008: ADDI X2,X12,#16
first instruction line takes 32 bits to specify; start byte is 0; when i request byte 0, i also get 4 bytes in total (0-3)
100: B #3 // go to 100+(12)=112
104: ...
108: ...
112: ...
CBZ
CBNZ
similar but if X1 is zeroPC += PC + offset*4
eg.
// multiply 5 six times
100: ADD X1, XZR, XZR // XZR(X31) is 0. clear the content of X1
104: ADDI X2, XZR, #6 // store 6 to X2
108: ADDI X1, X1, #5 // X1 += 5
112: SUBI X2, X2, #1 // X2 -= 1
116: CBZ X2, #-2 // if X2 is 0, go to 116+(-8)=108, otherwise go to 120
120: ...
M
is data memory100: LDUR X1, [X2,#100]
read from M[100+X2]
store in X1100: STUR X1, [X2,#100]
write X1 value to M[100+X2]
eg.
// f = (g + h) - (i + j)
// assume X10 contains the start address in mem of the data value f, and
// g, h, i, j are consecutive. each variable is 64-bit integer
LDUR X2, [X10, #8] // g into X2
LDUR X3, [X10, #16] // h into X3
LDUR X4, [X10, #24] // i into X4
LDUR X5, [X10, #32] // j into X5
ADD X6, X2, X3
ADD X7, X4, X5
SUB X1, X6, X7
STUR X1, [X10, #0] // X1 into f
eg. read data located in mem at address 240, take data and write it to register 7
LDUR X7, [X31, #240]
review electricity:
V = IR
value
+
| 1
+-------+ +----------+
| | | |
| | | |
| | 0 | |
| +---------+ +-----------+
|
+----------------------------------------+
time
A +--------+
+---> | | F
B | +-------->
+---> | |
C | |
+---> +--------+
OR, AND, NOT, NOR, NAND, XOR:
A | B | A+B | AB | A^B | |||
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
0 | 1 | 1 | 0 | 0 | 1 | 1 | |
1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
1 | 1 | 1 | 1 | 0 | 0 | 0 |
eg. conclude a formula from a truth table ((A,B,C)->F)
look at F's that have 1, use the product of A,B,C,~A,~B,~C to assemble 1
A | B | C | F | ||||||
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | |||||
0 | 0 | 1 | 1 | 1 | 1 | ||||
0 | 1 | 0 | 0 | 0 | |||||
0 | 1 | 1 | 0 | 0 | |||||
1 | 0 | 0 | 0 | 0 | |||||
1 | 0 | 1 | 1 | 1 | 1 | ||||
1 | 1 | 0 | 0 | 0 | |||||
1 | 1 | 1 | 1 | 1 | 1 |
(empty cells are 0)
we can say the formula that represents blackbox (A,B,C)->F is .
eg.
A | B | C | F |
---|---|---|---|
... | ... | ... | ... |
1 | 0 | 1 | 1 |
1 | 1 | 0 | 1 |
1 | 1 | 1 | 1 |
simplified:
A | B | C | F |
---|---|---|---|
... | ... | ... | ... |
1 | 0 | 1 | 1 |
1 | 1 | X | 1 |
eg.
A | B | C | F | |||||
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | ||||
0 | 0 | 1 | 1 | 1 | 1 | |||
0 | 1 | X | 0 | 0 | ||||
1 | X | 0 | 0 | 0 | ||||
1 | X | 1 | 1 | 1 | 1 |
identity | ||
zero/one | ||
absorption | ||
inverse | ||
commutative | ||
associative | ||
distributive | ||
DM |
eg. simplify a formula
q2: C
eg.
nice clean circuit: straight lines
we have
eg. deriving truth table from circuit
X | Y | A | B | C | F | ||
---|---|---|---|---|---|---|---|
1 | 1 | 0 | 1 | 1 | 0 | ||
1 | 0 | 1 | 0 | 1 | 1 | ||
0 | 1 | 1 | 1 | 0 | 1 | ||
0 | 0 | 1 | 1 | 1 | 0 |
eg.
input | resistance Q1 | F |
---|---|---|
0 | high | 1 |
1 | low | 0 |
input | resistance Q1 | F |
---|---|---|
0 | low | 0 |
1 | high | 1 |
A=1 is 5V, A=0 is 0.8V.
A | Q1 | Q2 | F |
---|---|---|---|
0 | low | high | 1 |
1 | high | low | 0 |
eg. what will the signal be at B?
the only way B is 0 is both Q1,Q2 are low, which is not possible in the configuration. so B is always 1.
in other cases B and F are in parallel, so voltages are equal
eg. what will the signal be at C?
C is always 0 because it is always grounded.
A | B | Q1 | Q2 | Q3 | Q4 | Z | |
---|---|---|---|---|---|---|---|
0 | 0 | low | low | high | high | 1 | |
0 | 1 | low | high | low | high | 1 | |
1 | 0 | high | low | low | high | 1 | |
1 | 1 | high | high | low | low | 0 | |
h | h | h | h | float | bad state | ||
l | l | l | l | short circuit | bad state |
eg. which resistances must be low to have a short circuit?
one of Q1,Q2, both of Q3,Q4
eg. which resistances must be high to have a float
both of Q1,Q2, one of Q3,Q4
eg. 3-to-8 (or 3-bit) decoder
dec | A2 | A1 | A0 | D7 | D6 | D5 | D4 | D3 | D2 | D1 | D0 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ||
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ||
2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ||
3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ||
4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ||
5 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ||
6 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ||
7 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
we can draw the decoder using symbol
implementation using only AND gates
q1 2
eg. 4-1 multiplexer
implementation:
or
eg. realize truth table using one mux
X---X
+X X
|X X
| + X Y Z |F
Z +---+D0 | 0 0 0 |0
| | 0 0 1 |1
| | 0 1 0 |0
0 +---+D1 | 0 1 1 |0
| +------+F 1 0 1 |0
| | 1 0 0 |1
~Z +---+D2 | 1 1 0 |1
| | 1 1 1 |1
| |
1 +---+D3 |
| X (when X=Y=0, F is decided
+X X by D0, we see Z behaves
X ---X the same as F...)
+ + when X=0,Y=1,...
| |
+ +
S1 S0
X Y
q1 d
rise fall 1
+-------+ +-------+
| | | | 0
+-----+ +-------+ +------+
<--------------->
clock cycle
latch store S=0, R=0. for any previous value (Q, ~Q), the outputs (Q, ~Q) don't change
latch set S=1, R=0. output becomes (Q, ~Q) = (1, 0)
latch reset S=0, R=1. output becomes (Q, ~Q) = (0, 1)
once set or reset completes, the system will return to steady state S=0, R=0. so new value stored.
latch oscillate S=1, R=1. undesirable behavior. should be avoided.
C | D | next state of Q |
---|---|---|
0 | x | no change |
1 | 0 | Q=0 (reset) |
1 | 1 | Q=1 (set) |
register: an array of flip-flops
register file: a way to organize registers
add x1,x2,x3
) (two 5bit numbers to specify because we have 31 registers) 5bit +---------------------------------+
+-------> read read data | 64bit
| register #1 output #1 +--------->
5 | |
+-------> read |
| register #2 |
5 | |
+-------> write |
| register # |
64 | read data | 64
+-------> write output #2 +--------->
| data |
| write |
+--------------^------------------+
|control bit
|
+
control bit is 1 if one wants to write stuff (add
, ldur
), it is 0 if not (b
, cbz
).
note the number of register does not depend on architecture.
eg. 64bit architecture, 16 register in total. then
the only difference is the size of decoder (input/output).
we read two registers in parallel.
we never do both read and write at the same time
eg. traffic lights
+ N +
| |
| |
+-----+
NS light+-----+
+--------+ +-------+
||
W || S
||
+--------+ +-------+
| |SW light
| |
| |
+ S +
one bit is used to indicate green/red.
eg. extend the traffic light to handle yellow light
state transition table:
current state | inputs | next state | |
---|---|---|---|
S2 S1 S0 | NScar EWcar TS | S2' S1' S0' | |
000 | XX0 | 000 | |
000 | XX1 | 001 | |
001 | X0X | 001 | |
001 | X1X | 010 | |
010 | XXX | 100 | |
011 | XXX | XXX | don't care |
100 | XX0 | 100 | |
100 | XX1 | 101 | |
101 | 0XX | 101 | |
101 | 1XX | 110 | |
110 | XXX | 000 | |
111 | XXX | XXX | don't care |
use the min terms (output bit to be 1), we can conclude formula for each bit
output table:
S2 S1 S0 | NSg NSy NSr EWg EWy EWr TR | |
---|---|---|
000 | 1000010 | NSg,EWr |
001 | 1000010 | NSg,EWr |
010 | 0100011 | NSy,EWr,TR |
011 | XXXXXXX | |
100 | 0011000 | NSr,EWg |
101 | 0011000 | NSr,EWg |
110 | 0010101 | NSr,EWy,TR |
111 | XXXXXXX |
eg. can draw a circuit for one state and one output of them
(triangle means clocked)
armv8 word:
...
1 0011011
see cs241 note
eg. with 4 bits, 1110
is -6. what's this number with 8 bits?
copy old MSB to the new positions, answer is 1111 1110
.
1101 | ||
arithmetic shift right | 1110 | |
logical shift right | 0110 | |
shift left | 1010 | |
circular shift right | 1110 | use overflown bit for void space |
circular shift left | 1011 | |
bitwise and, or, xor, nor |
A | B | CarryIn | CarryOut | Sum | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | |
0 | 0 | 1 | 0 | 1 | |
0 | 1 | 0 | 0 | 1 | |
0 | 1 | 1 | 1 | 0 | |
1 | 0 | 0 | 0 | 1 | |
1 | 0 | 1 | 1 | 0 | |
1 | 1 | 0 | 1 | 0 | |
1 | 1 | 1 | 1 | 1 |
operation is the input to the mux (0,1,2,3)
ALU operation | function | |
---|---|---|
Ainv Binv Op1 Op0 | ||
0000 | AND | |
0001 | OR | |
0010 | add | |
0110 | sub | |
1100 | NOR | |
0011 | Pass | result=B |
note Binv is connected to the first CarryIn
eg. to compute subtraction a-b=a+(-b), we invert b, add to a, then add 1.
eg. to compute A NOR B, we invert both a and b, and perform A AND B. (since )
eg. it can also do NAND.
eg. overflow happens only if A, B have same sign and the result has different sign, iff the final sign bit's CarryIn and CarryOut are different.
if two operands have different sign, the final CarryOut is ignored if there is.
we use the following notation to represent ALU:
procedure for doing 64bit multiplication -> result is 128 bit:
+-----+
|start|
+--+--+
|
| <--------------------+
v |
Multiplier0=1 +----+-----+ Multiplier0=0 |
+------+ |1.test |-----+ |
| |Multiplier0 | |
| +----------+ | |
v | |
+----+-------+ | |
|1a.add | | |
|multiplicand| | |
|to product; | | |
|place result| | |
|to Product +---+ <---+ |
|register | | | |
+------------+ | | |
v v |
+----+---------+-----+ |
|2.shift multiplicand| |
|register left 1 bit | |
+---------+----------+ |
| |
+---------v-----------+ |
|2.shift multiplier | |
|register right 1 bit | |
+--------+------------+ |
| |
+-----v-----+ |
| 64 reps? +--------------+
| | No
+-----------+
|Yes
v
+--+-+
|done|
+----+
the hardware:
to represent a fractional number
eg. convert a binary point number to decimal
10.10101 -> 1*2^1 + 0*2^0 + 1*2^-1 + 0*2^-2 1*2^-3 + +0*2^-4 + 1*2^-5
-> 2.65625
31 30 22 0
+---+-------------+-----------------------------+
| S | exponent | significand |
+---+-------------+-----------------------------+
1bit 8bits 23bits
interpreted value from the hardware is
where bias=127 for single precision.
eg. i want an exponent of -50: 2^-50. what is my biased 8 bit representation?
77-127=-50 so it is 77.
eg. i want an exponent of -50: 2^50. what is my biased 8 bit representation?
177-127=50 so it is 177.
eg. convert 0.625 decimal to binary
multiply fraction by 2 repeatedly:
0.625*2=1.25 | keep 1 as first digit | 0.1 |
0.25*2=0.5 | keep 0 as next digit | 0.10 |
0.5*2=1.0 | keep 1 as next digit | 0.101 |
rest digit is 0 so it is done. the result is precisely 0.101.
eg. convert 2.625 to binary
2 is 10 as binary,
.625 is .101 as binary.
so 2.625 is 10.101 * 2^0 as binary.
need to normalize, so it should be 1.0101 * 2^1.
SIGN BIT: 0 (positive)
exponent: x-bias should give 1+127=128
significant bits: 0101
hence answer is 0 10000000 01010000000000000000000.
eg. convert 0.1 to floating point number
0.1*2=0.2 | keep 0 as first digit | 0.0 |
0.2*2=0.4 | keep 0 as next digit | 0.00 |
0.4*2=0.8 | keep 0 as next digit | 0.000 |
0.8*2=1.6 | keep 1 as next digit | 0.0001 |
0.6*2=1.2 | keep 1 as next digit | 0.00011 |
0.2*2=0.4 | repeat |
repeating pattern is which is infinite. have to truncate.
eg. what is binary reading from 0 01111111 10101000000000000000000
exponent bit is 01111111 -> 127, so real exponent is 127-127=0. the binary number is 1.10101 * 2^0.
eg. overflow is detected using exponent. 1.010101010101010101010101010101010101010101 * 2^-129 has overflow, but that is determined by "-129" not the precision bits (which are truncated anyways).
+---------------------------------------+ +------+
| +------+ +-----------------+ +-----+ | | |
| | | | | | | | | DRAM |
| | SRAM | | REGISTER FILE | |SRAM | | | |
| | | | | | | | | | DISK
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| +------+ +-----------------+ +-----+ | | |
+---------------------------------------+ +------+
21bits
address +----> +-------------+
| |
| |
chip select +----> | |
| SRAM | 16
output +----> | 2M x 16 +------>Dout[15-0]
| |
write enable+----> | |
| |
16bits | |
Din[15-0]+----> +-------------+
address is 21 bits and data is 16 bits in this eg.
C | X | F | |
---|---|---|---|
1 | 1 | 1 | both low resistance, transmit as-is |
1 | 0 | 1 | both low resistance, transmit as-is |
0 | x | - | both high resistance, both transistors off |
eg. 3-state buffer is depicted by triangle. use them to select a data
structure of SRAM
better SRAM uses min 6 transistors / bit, register file 46 transistors / bit.
by using two level encoding, we split the address, half go to the decoder to select word, half go to the mux to select a particular bit out of the word.
important:
CBZ
).
ADD
, two registers X2,X3 go to ALU, result goes back to register file as data, the specified register X1 is written.an instruction is specified by 32 bits
R-format: ADD X1, X2, X3 => ADD Rd, Rn, Rm
opcode (31-21)
Rm (20-16)
shamt (15-10)
Rn (9-5)
Rd (4-0)
I-format: ADDI X1, X2, #4 => ADDI Rd, Rn, immediate
opcode (31-22)
ALU immediate (21-10) (unsigned 12 bit)
Rn (9-5)
Rd (4-0)
D-format: LDUR X1, [X2, #200] => LDUR Rd, [Rn, immediate]
opcode (31-21)
DT address (20-12) (signed 9 bit)
op (11-10) (not used)
Rn (9-5)
Rd (4-0)
B-format: B #3000
opcode (31-26)
BR address (25-0) (signed 26 bit)
CB-format: CBZ X1, #3000 => CBZ Rt, immediate
opcode (31-24)
COND BR address (23-5) (signed 19 bit)
Rt (4-0)
instruction | opcode | format |
---|---|---|
B | 0001 01 | B |
ADD | 1000 1011 000 | R |
ADDI | 1001 0001 00 | I |
CBZ | 1011 0100 | CB |
CBNZ | 1011 0101 | CB |
SUB | 1100 1011 000 | R |
SUBI | 1101 0001 00 | I |
STUR | 1111 1000 000 | D |
LDUR | 1111 1000 010 | D |
Control bits:
type | Reg2Loc | ALUSrc | MemToReg | RegWrite | MemRead | MemWrite | Branch | ALUop1 | ALUop0 |
---|---|---|---|---|---|---|---|---|---|
R-format | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
LDUR | x | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
STUR | 1 | 1 | x | 0 | 0 | 1 | 0 | 0 | 0 |
CBZ | 1 | 0 | x | 0 | 0 | 0 | 1 | 0 | 1 |
eg. ADD X1, X2, X3
is R-format
suppose current PC is 100. first these happen in parallel:
X2
) goes to Read register 1
X3
) goes to Mux, 0 is chosen, so it goes to Read register 2
X1
) goes to Write register; also goes to Mux but is not chosento the right:
to the right:
at the top:
ALUop has four choices: 00,01,10,11. if it is 10, that means instruction is R format. we will use function bit from instruction[31-21] to determine what to do on ALU. otherwise function bits are useless, we forward the ALUop to ALU (ie 00 -> add, 11 -> sub, 01 -> passb). in this case ADD
is R-format, so funct bits are used.
eg. ADDI X1, X2, #4
is I-format
PC operations are same as above. first
X2
)ADDI
or SUBI
, then instruction[21-10] are unsigned, 12 bit immediate value i care about. Sign-extend generates a 64 bit number by padding 0's even if top bit may be 1. (note if other instructions where immediates are signed, it will pad 1's correctly for negative values, such as LDUR
)to the right:
eg. LDUR X1, [X2, #400]
is D-format
eg. CBZ X1, #100
is CB-format
at the top:
at same time:
at the right:
overview:
operation | ALUop | opcode | ALU control input | action |
---|---|---|---|---|
LDUR | 00 | 0010 | add | |
STUR | 00 | 0010 | add | |
CBZ | 01 | 0011 | passb | |
ADDI | 00 | 0010 | add | |
SUBI | 11 | 0110 | sub | |
ADD | 10 | 100010 11000 | 0010 | add |
SUB | 10 | 110010 11000 | 0110 | sub |
AND | 10 | 100010 10000 | 0000 | and |
ORR | 10 | 101010 10000 | 0001 | or |
eg.
eg. now assume two top PC adders also take 200ps. timings stay the same because top two adders run in parallel with bottom.
eg. now assume Control mux takes 10ps. then R-formats takes extra 10ps (reads both reg, have to wait for mux); LDUR does not (it only reads one reg); STUR takes extra 10ps; CBZ takes extra 10ps (its reg goes to Mux)
eg. B #400
31-26 is opcode, 25-0 is BR address. also compute PC + 4*offset.
type | Reg2Loc | ALUSrc | MemToReg | RegWrite | MemRead | MemWrite | Branch | ALUop1 | ALUop0 | Uncond branch |
---|---|---|---|---|---|---|---|---|---|---|
B | x | x | x | 0 | 0 | 0 | x | x | x | 1 |
Uncond branch is 0 for other instructions.
needs to take 400ps in above settings.
memread, memwrite, regwrite must not be don't care
eg. at the end of every clock cycle, PC is updated to PC+4 instantly
pipelined control
hazard: event that blocks normal flow of instructions through pipeline
eg.
000 SUB X2,X1,X3
004 NOP
008 NOP
012 AND X12,X2,X5
016 ORR X13,X6,X2
020 ADD X14,X2,X2
024 STUR X15,[X2,#100]
NOP
has its own line number, uses one clock cyclesolution2: in SUB, X2 is computed in ALU in CC3, in AND, X2 is an input of ALU in CC4. we can forward the computed value directly.
eg. have two R-formats. if destination register of one instruction is the first source of the next instruction, then the condition is
where
eg.
ADD X2, X6, X5
AND X12, X2, X5
eg. load-use hazard: this does not work if the first instruction is LDUR X2, [...] since the value is not ready. we have to delay it (by inserting a NOP between LDUR and AND), and do the forwarding next stage
LDUR X2, [X5, #100] // NOP should be followed
AND X12, X2, X5
eg. consider code
1. LDUR X1,[X2,#10]
2. LDUR X3,[X5,#10] // load-use hazard
3. ADD X3,X5,X3
4. STUR X3,[X5,#10]
5. LDUR X1,[X6,#10] // load-use hazard
6. ADD X3,X1,X4
7. STUR X4,[X3,#10]
8. ADDI X6,X4,X4
conditions of forwarding
EX/MEM.RegWrite and EX/MEM.RegisterRd != 31 and EX/MEM.RegisterRd == ID/EX.RegisterRn1
, then forwardA = 10MEM/WB.RegWrite and MEM/WB.RegisterRd != 31 and not (EX/MEM.RegWrite and EX/MEM.Rd != 31 and EX/MEM.RegisterRd == ID/EX.RegisterRn1) and MEM/WB.RegisterRd == ID/EX.RegisterRn1
, then ForwardA = 01forwarding helps everything except LDUR.
conditions
ID/EX.MemRead and (ID/EX.RegisterRd == IF/ID.RegisterRn1 or ID/EX.RegisterRd == IF/ID.RegisterRm2)
eg. insert line 5 between 2 and 3 in the last eg to avoid NOP.
eg. in the last eg, it takes 8 (each) + 2 (two load-use stalling) + 4 (start) = 14 cycles
eg.
before the branching decision is made in the branch in mem stage, three instructions (in IF, ID, EX stages) already began, we might want to cancel them if the branch condition is met.
CBZ X1,9
right after it goes out of register fileeg. what is the running time for
100 ADDI X1,XZR,#20
104 ADDI X2,XZR,#0
108 LDUR X3,[X4,#0] // loop
112 ADD X2,X2,X3 // hazard <- insert stalling
116 ADDI X4,X4,#8
120 SUBI X1,X1,#1
124 CBNZ X1,#-4 // every time this jumps, add flushing
128 SUB X5,X5,X2
132 LDUR X3,[X6,#100]
136 ADDI X29,XZR,#100
assuming forwarding, stalling, and branching in MEM
if there is no flushing, we have to insert 3 NOPs after line 124.
assuming branching in ID
if there is no flushing, we have to insert 1 NOP after line 124.
if we swap line 112 and 120, both data hazard in 112 and branch data hazard in 124 are also resolved -> time: 168-20(2)=128
a branch data hazard can exist in any pipelined datapath when forwarding is not present.
in the IF stage
eg. strategies:
100 ADDI X4,X31,#6
...
120 SUBI X1,X1,#1
124 CBNZ X4,#-5
128 ADD X1,X2,X3
eg. for the above eg
principles of locality:
speed | CPU | size | cost | current technology |
---|---|---|---|---|
fastest | memory | smallest | highest | SRAM |
memory | DRAM | |||
slowest | memory | biggest | lowest | magnetic disk |
ignore L2, L3; imagine we only have L1 cache (instruction memory/cache, data memory) and main memory.
eg. one way associative: items map to one and only one location
when CPU needs memory access at address abcde
, cde
called lower order 3 bits, ab
called tag bits.
cde
is populated with data at mem[abcde]
copied from memory, and has a tag bit ab
. the valid bit is updated to 1.cde
is indexed in cache, if V is 1, check the tag bit, if it is ab
, get the data there.
V: valid bit: cache is meaningful or empty cache
eg. 2-way set associative cache: 2 rows
memory address being abcde
, for cache, index bit is de
, have two tag fields, each 3 bits (abc
)
de
, then find empty slots in this row and store abc
as tag bits and copy mem data.
generally increase number of hits
eg. 4-way set associative cache: 2x4 configuration
eg. fully associative cache: 1x8 configuration, only one unique addr, 8 tag fields, 5 tag bits
eg. implementation of 4-way associative cache
note all 4 columns are compared against tag bits in parallel.
eg. as associativity (column) increase, lookup time remains the same but hardware increases, number of possible address in cache increases.
eg.
eg.
eg.
run time: 104*3 (misses) + 104*1 (writeback) + 6(instr), assuming no hazard
LDUR X1, [X2,#0]
is virtual addresswhen program 1 loaded:
remark. all pages are always on disk.
page replacement policy:
if Ref bit is on, that means the address translation exists in both page table and TLB. if we kick out the page from RAM (thus from page table), we release both page table and TLB references (set valid to 0, if dirty is 1, write back). we also remove cache entry.
remark. there are two levels of dirty bits: page level and cache level. for STUR, we update cache, page, then to disk.