page: https://www.student.cs.uwaterloo.ca/~cs251/S20/index.shtml

Instructor: Rosina Kharal, Mohammad Mazaheri Kalahrody

Week 1. May 11

Q1-Q12, A0-A6, T1-T3

defn. assembly language is any low-level programming language with a very strong correspondence between the instructions in the language and the architecture's machine code instructions.

compiler(cs241) assembler(cs251) high level language -> assembly -> binary machine code

ARM overview

+---------------------+ | main program | +------------------------------------------------------+ | | | CPU main processor | | | | | | | | +----------+ +-----------+ | |+-------------------+| | | | | | | ||my program code data| | | | | | | || || | |instruction +---------------+ |data | | ||000 ... || | |memory | | register file | |memory | | ||008 ... || | | | |X0 2.fetch | | | | || || | |1.all code| |X1 LDUR and | |1.all data | | || || | |is copied | | execute.. | |is copied | | || || | |here: | | 4.fetch | |here | | || || | | | | ADD and exec. | | | || || | |000 LDUR .. | 5.fetch STUR| |200 arr[0] | | || || | |004 ADD ..| | and exec | | | | || || | | | | | |3.data | | || || | | | |X31/XZR | |accessed | | || || | | | +---------------+ |here | | || || | | | |6.data | | || || | | | |stored here| | || || | +----------+ +-----------+ | || || | | |+-------------------+| +------------------------------------------------------+ | | | | +---------------------+

fetch-execute cycle:

register

each register holds max 64 bits. there are 32 registers: X0,X1,...,X31/XZR

eg. exec in register

// assume values are preloaded into X... registers // X1: f, X2: g, X3: h, X4: i, X5: j // f = (g + h) - (i + j) // equivalent ARM: // dest lhs rhs ADD X6, X2, X3 ;; X2, X3 not changed ADD X7, X4, X5 ;; X6, X7 hold temp value SUB X1, X6, X7 ;; X1 has the final value // then put value in X1 to f

types of arm instructions:

there are instruction memory and data memory.

each instruction requires 32 bits (4 bytes)

eg. program in memory

// PC instruction 000: ADD X1,X2,X3 004: SUB X1,X3,X5 008: ADDI X2,X12,#16

first instruction line takes 32 bits to specify; start byte is 0; when i request byte 0, i also get 4 bytes in total (0-3)

branch

100: B #3 // go to 100+(12)=112 104: ... 108: ... 112: ...

conditional branch CBZ

eg.

// multiply 5 six times 100: ADD X1, XZR, XZR // XZR(X31) is 0. clear the content of X1 104: ADDI X2, XZR, #6 // store 6 to X2 108: ADDI X1, X1, #5 // X1 += 5 112: SUBI X2, X2, #1 // X2 -= 1 116: CBZ X2, #-2 // if X2 is 0, go to 116+(-8)=108, otherwise go to 120 120: ...

memory access

eg.

// f = (g + h) - (i + j) // assume X10 contains the start address in mem of the data value f, and // g, h, i, j are consecutive. each variable is 64-bit integer LDUR X2, [X10, #8] // g into X2 LDUR X3, [X10, #16] // h into X3 LDUR X4, [X10, #24] // i into X4 LDUR X5, [X10, #32] // j into X5 ADD X6, X2, X3 ADD X7, X4, X5 SUB X1, X6, X7 STUR X1, [X10, #0] // X1 into f

eg. read data located in mem at address 240, take data and write it to register 7

LDUR X7, [X31, #240]

Week 2. May 18

logic blocks

review electricity: V = IR

digitizing

discrete signal

value + | 1 +-------+ +----------+ | | | | | | | | | | 0 | | | +---------+ +-----------+ | +----------------------------------------+ time
A +--------+ +---> | | F B | +--------> +---> | | C | | +---> +--------+

specifying input/output behaviors

truth tables

OR, AND, NOT, NOR, NAND, XOR:

A B A+B AB A\overline{A} A+B\overline{A+B} AB\overline{AB} A^B
0 0 0 0 1 1 1 0
0 1 1 0 0 1 1
1 0 1 0 0 0 1 1
1 1 1 1 0 0 0

eg. conclude a formula from a truth table ((A,B,C)->F)
look at F's that have 1, use the product of A,B,C,~A,~B,~C to assemble 1

A B C F ABC\overline{A}\cdot\overline{B}C ABCA\overline{B}C ABCABC ABC+ABC+ABC\overline{A}\cdot\overline{B}C+A\overline{B}C+ABC
0 0 0 0 0
0 0 1 1 1 1
0 1 0 0 0
0 1 1 0 0
1 0 0 0 0
1 0 1 1 1 1
1 1 0 0 0
1 1 1 1 1 1

(empty cells are 0)
we can say the formula that represents blackbox (A,B,C)->F is ABC+ABC+ABC\overline{A}\cdot\overline{B}C+A\overline{B}C+ABC.

don't cares in truth tables

eg.

A B C F
... ... ... ...
1 0 1 1
1 1 0 1
1 1 1 1

simplified:

A B C F
... ... ... ...
1 0 1 1
1 1 X 1

eg.

A B C F ABC\overline{A}\cdot\overline{B}C ACAC ABC+AC\overline{A}\cdot\overline{B}C+AC
0 0 0 0 0
0 0 1 1 1 1
0 1 X 0 0
1 X 0 0 0
1 X 1 1 1 1

laws of boolean algebra

X=X\overline{\overline{X}}=X
X+0=XX+0=X X1=XX\cdot1=X identity
X+1=1X+1=1 X0=0X\cdot 0=0 zero/one
X+X=XX+X=X XX=XXX=X absorption
X+X=1X+\overline{X}=1 XX=0X\overline{X}=0 inverse
X+Y=Y+XX+Y=Y+X XY=YXXY=YX commutative
X+(Y+Z)=(X+Y)+ZX+(Y+Z)=(X+Y)+Z X(YZ)=(XY)ZX(YZ)=(XY)Z associative
X(Y+Z)=XY+XZX(Y+Z)=XY+XZ X+YZ=(X+Y)(X+Z)X+YZ=(X+Y)(X+Z) distributive
X+Y=XY\overline{X+Y}=\overline{X}\cdot\overline{Y} XY=X+Y\overline{XY}=\overline{X}+\overline{Y} DM

eg. simplify a formula

F=XˉYˉZ+XYˉZ+XYZˉ+XYZ=YˉZ(Xˉ+X)+XY(Zˉ+Z)=YˉZ(1)+XY(1)=YˉZ+XY\begin{aligned} F &=\bar{X} \bar{Y} Z+X \bar{Y} Z+X Y \bar{Z}+X Y Z \\ &=\bar{Y} Z(\bar{X}+X)+X Y(\bar{Z}+Z) \\ &=\bar{Y} Z(1)+X Y(1) \\ &=\bar{Y}Z+XY \end{aligned}

q2: C

gates

img

img

eg.
nice clean circuit: straight lines
img

we have F=XYZ+XYZ+XYZ\mathrm{F}=\mathrm{X} \overline{\mathrm{Y}}\cdot \overline{\mathrm{Z}}+\mathrm{XY} \overline{\mathrm{Z}}+\overline{\mathrm{X}} \mathrm{YZ}

eg. deriving truth table from circuit

xor

X Y A B C F
1 1 0 1 1 0
1 0 1 0 1 1
0 1 1 1 0 1
0 0 1 1 1 0

transistors

implementing gates with transistors

eg.
img

NMOS as Not gate

img

input resistance Q1 F
0 high 1
1 low 0

PMOS

img

input resistance Q1 F
0 low 0
1 high 1

A=1 is 5V, A=0 is 0.8V.

CMOS (complementary MOS transistor)

img

A Q1 Q2 F
0 low high 1
1 high low 0

eg. what will the signal be at B?
the only way B is 0 is both Q1,Q2 are low, which is not possible in the configuration. so B is always 1.
in other cases B and F are in parallel, so voltages are equal

eg. what will the signal be at C?
C is always 0 because it is always grounded.

CMOS NAND

img

A B Q1 Q2 Q3 Q4 Z
0 0 low low high high 1
0 1 low high low high 1
1 0 high low low high 1
1 1 high high low low 0
h h h h float bad state
l l l l short circuit bad state

eg. which resistances must be low to have a short circuit?
one of Q1,Q2, both of Q3,Q4

eg. which resistances must be high to have a float
both of Q1,Q2, one of Q3,Q4

2-input AND

img

3-input NAND

img

Week 3. May 25

decoders

eg. 3-to-8 (or 3-bit) decoder

dec A2 A1 A0 D7 D6 D5 D4 D3 D2 D1 D0
0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0 1 0
2 0 1 0 0 0 0 0 0 1 0 0
3 0 1 1 0 0 0 0 1 0 0 0
4 1 0 0 0 0 0 1 0 0 0 0
5 1 0 1 0 0 1 0 0 0 0 0
6 1 1 0 0 1 0 0 0 0 0 0
7 1 1 1 1 0 0 0 0 0 0 0

we can draw the decoder using symbol
img

implementation using only AND gates
img

memory expansion

multiplexer

q1 2

eg. 4-1 multiplexer
img

implementation:
img

or

img

eg. realize truth table using one mux

X---X +X X |X X | + X Y Z |F Z +---+D0 | 0 0 0 |0 | | 0 0 1 |1 | | 0 1 0 |0 0 +---+D1 | 0 1 1 |0 | +------+F 1 0 1 |0 | | 1 0 0 |1 ~Z +---+D2 | 1 1 0 |1 | | 1 1 1 |1 | | 1 +---+D3 | | X (when X=Y=0, F is decided +X X by D0, we see Z behaves X ---X the same as F...) + + when X=0,Y=1,... | | + + S1 S0 X Y

array

img

q1 d

clocks and sequential circuits

rise fall 1 +-------+ +-------+ | | | | 0 +-----+ +-------+ +------+ <---------------> clock cycle

SR latch

img

latch store S=0, R=0. for any previous value (Q, ~Q), the outputs (Q, ~Q) don't change

latch set S=1, R=0. output becomes (Q, ~Q) = (1, 0)

latch reset S=0, R=1. output becomes (Q, ~Q) = (0, 1)

once set or reset completes, the system will return to steady state S=0, R=0. so new value stored.

latch oscillate S=1, R=1. undesirable behavior. should be avoided.

D-latch

img

C D next state of Q
0 x no change
1 0 Q=0 (reset)
1 1 Q=1 (set)

img

D Flip-Flop

img

img

Week 4. June 1

register file

register: an array of flip-flops

register file: a way to organize registers

5bit +---------------------------------+ +-------> read read data | 64bit | register #1 output #1 +---------> 5 | | +-------> read | | register #2 | 5 | | +-------> write | | register # | 64 | read data | 64 +-------> write output #2 +---------> | data | | write | +--------------^------------------+ |control bit | +

control bit is 1 if one wants to write stuff (add, ldur), it is 0 if not (b, cbz).

write operation

img

note the number of register does not depend on architecture.

eg. 64bit architecture, 16 register in total. then

the only difference is the size of decoder (input/output).

read operation

we read two registers in parallel.

img

we never do both read and write at the same time

finite state machine (Moore)

eg. traffic lights

+ N + | | | | +-----+ NS light+-----+ +--------+ +-------+ || W || S || +--------+ +-------+ | |SW light | | | | + S +

img

one bit is used to indicate green/red.

eg. extend the traffic light to handle yellow light

img

state transition table:

current state inputs next state
S2 S1 S0 NScar EWcar TS S2' S1' S0'
000 XX0 000
000 XX1 001
001 X0X 001
001 X1X 010
010 XXX 100
011 XXX XXX don't care
100 XX0 100
100 XX1 101
101 0XX 101
101 1XX 110
110 XXX 000
111 XXX XXX don't care

use the min terms (output bit to be 1), we can conclude formula for each bit

output table:

S2 S1 S0 NSg NSy NSr EWg EWy EWr TR
000 1000010 NSg,EWr
001 1000010 NSg,EWr
010 0100011 NSy,EWr,TR
011 XXXXXXX
100 0011000 NSr,EWg
101 0011000 NSr,EWg
110 0010101 NSr,EWy,TR
111 XXXXXXX

eg. can draw a circuit for one state and one output of them

img

(triangle means clocked)

Week 5. June 8

armv8 word:

ASCII

...

unsigned and signed binary numbers

see cs241 note

eg. with 4 bits, 1110 is -6. what's this number with 8 bits?
copy old MSB to the new positions, answer is 1111 1110.

logic operations

1101
arithmetic shift right 1110
logical shift right 0110
shift left 1010
circular shift right 1110 use overflown bit for void space
circular shift left 1011
bitwise and, or, xor, nor

full adder

img

A B CarryIn CarryOut Sum
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1

4-bit ripple-carry adder

img

arithmetic logic unit

1-bit ALU

img

operation is the input to the mux (0,1,2,3)

ALU operation function
Ainv Binv Op1 Op0
0000 AND
0001 OR
0010 add
0110 sub
1100 NOR
0011 Pass result=B

note Binv is connected to the first CarryIn

eg. to compute subtraction a-b=a+(-b), we invert b, add to a, then add 1.

eg. to compute A NOR B, we invert both a and b, and perform A AND B. (since A+B=AB\overline{A+B}=\overline{A}\cdot\overline{B})

eg. it can also do NAND.

64-bit ALU

img

eg. overflow happens only if A, B have same sign and the result has different sign, iff the final sign bit's CarryIn and CarryOut are different.
if two operands have different sign, the final CarryOut is ignored if there is.

we use the following notation to represent ALU:

img

multiplication

procedure for doing 64bit multiplication -> result is 128 bit:

+-----+ |start| +--+--+ | | <--------------------+ v | Multiplier0=1 +----+-----+ Multiplier0=0 | +------+ |1.test |-----+ | | |Multiplier0 | | | +----------+ | | v | | +----+-------+ | | |1a.add | | | |multiplicand| | | |to product; | | | |place result| | | |to Product +---+ <---+ | |register | | | | +------------+ | | | v v | +----+---------+-----+ | |2.shift multiplicand| | |register left 1 bit | | +---------+----------+ | | | +---------v-----------+ | |2.shift multiplier | | |register right 1 bit | | +--------+------------+ | | | +-----v-----+ | | 64 reps? +--------------+ | | No +-----------+ |Yes v +--+-+ |done| +----+

the hardware:
img

floating point number

to represent a fractional number

eg. convert a binary point number to decimal

10.10101 -> 1*2^1 + 0*2^0 + 1*2^-1 + 0*2^-2 1*2^-3 + +0*2^-4 + 1*2^-5 -> 2.65625

Floating Point registers

IEEE 754

31 30 22 0 +---+-------------+-----------------------------+ | S | exponent | significand | +---+-------------+-----------------------------+ 1bit 8bits 23bits
  1. normalized numbers: always assume there exists a leading 1 in the number (save one bit). the only exception is when the number is 0
  2. biased exponent: x-127 (x is the value in the exponent field)
  3. sign: 1 means negative, 0 means positive

interpreted value from the hardware is

(1)Sign×(1+significand)×2exponent - bias(-1)^\text{Sign}\times(1+\text{significand})\times 2^\text{exponent - bias}

where bias=127 for single precision.

eg. i want an exponent of -50: 2^-50. what is my biased 8 bit representation?
77-127=-50 so it is 77.

eg. i want an exponent of -50: 2^50. what is my biased 8 bit representation?
177-127=50 so it is 177.

eg. convert 0.625 decimal to binary
multiply fraction by 2 repeatedly:

0.625*2=1.25 keep 1 as first digit 0.1
0.25*2=0.5 keep 0 as next digit 0.10
0.5*2=1.0 keep 1 as next digit 0.101

rest digit is 0 so it is done. the result is precisely 0.101.

eg. convert 2.625 to binary
2 is 10 as binary,
.625 is .101 as binary.
so 2.625 is 10.101 * 2^0 as binary.
need to normalize, so it should be 1.0101 * 2^1.
SIGN BIT: 0 (positive)
exponent: x-bias should give 1+127=128
significant bits: 0101
hence answer is 0 10000000 01010000000000000000000.

eg. convert 0.1 to floating point number

0.1*2=0.2 keep 0 as first digit 0.0
0.2*2=0.4 keep 0 as next digit 0.00
0.4*2=0.8 keep 0 as next digit 0.000
0.8*2=1.6 keep 1 as next digit 0.0001
0.6*2=1.2 keep 1 as next digit 0.00011
0.2*2=0.4 repeat

repeating pattern is 0.000110.0\overline{0011} which is infinite. have to truncate.

eg. what is binary reading from 0 01111111 10101000000000000000000
exponent bit is 01111111 -> 127, so real exponent is 127-127=0. the binary number is 1.10101 * 2^0.

eg. overflow is detected using exponent. 1.010101010101010101010101010101010101010101 * 2^-129 has overflow, but that is determined by "-129" not the precision bits (which are truncated anyways).

addition

multiplication

Week 6. June 15

memory hierarchy

+---------------------------------------+ +------+ | +------+ +-----------------+ +-----+ | | | | | | | | | | | | DRAM | | | SRAM | | REGISTER FILE | |SRAM | | | | | | | | | | | | | | DISK | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +------+ +-----------------+ +-----+ | | | +---------------------------------------+ +------+

static random access memory (SRAM)

21bits address +----> +-------------+ | | | | chip select +----> | | | SRAM | 16 output +----> | 2M x 16 +------>Dout[15-0] | | write enable+----> | | | | 16bits | | Din[15-0]+----> +-------------+

address is 21 bits and data is 16 bits in this eg.

three-state buffer

img

C X F
1 1 1 both low resistance, transmit as-is
1 0 1 both low resistance, transmit as-is
0 x - both high resistance, both transistors off

eg. 3-state buffer is depicted by triangle. use them to select a data

img

structure of SRAM

img

better SRAM uses min 6 transistors / bit, register file 46 transistors / bit.

dynamic RAM (DRAM)

img

by using two level encoding, we split the address, half go to the decoder to select word, half go to the mux to select a particular bit out of the word.

important:

single cycle data path

high level view

img

instruction formats

an instruction is specified by 32 bits

R-format: ADD X1, X2, X3 => ADD Rd, Rn, Rm opcode (31-21) Rm (20-16) shamt (15-10) Rn (9-5) Rd (4-0) I-format: ADDI X1, X2, #4 => ADDI Rd, Rn, immediate opcode (31-22) ALU immediate (21-10) (unsigned 12 bit) Rn (9-5) Rd (4-0) D-format: LDUR X1, [X2, #200] => LDUR Rd, [Rn, immediate] opcode (31-21) DT address (20-12) (signed 9 bit) op (11-10) (not used) Rn (9-5) Rd (4-0) B-format: B #3000 opcode (31-26) BR address (25-0) (signed 26 bit) CB-format: CBZ X1, #3000 => CBZ Rt, immediate opcode (31-24) COND BR address (23-5) (signed 19 bit) Rt (4-0)

opcodes

instruction opcode format
B 0001 01 B
ADD 1000 1011 000 R
ADDI 1001 0001 00 I
CBZ 1011 0100 CB
CBNZ 1011 0101 CB
SUB 1100 1011 000 R
SUBI 1101 0001 00 I
STUR 1111 1000 000 D
LDUR 1111 1000 010 D

full data path

img

walking through data path

Control bits:

type Reg2Loc ALUSrc MemToReg RegWrite MemRead MemWrite Branch ALUop1 ALUop0
R-format 0 0 0 1 0 0 0 1 0
LDUR x 1 1 1 1 0 0 0 0
STUR 1 1 x 0 0 1 0 0 0
CBZ 1 0 x 0 0 0 1 0 1

eg. ADD X1, X2, X3 is R-format

img

suppose current PC is 100. first these happen in parallel:

to the right:

to the right:

at the top:

ALUop has four choices: 00,01,10,11. if it is 10, that means instruction is R format. we will use function bit from instruction[31-21] to determine what to do on ALU. otherwise function bits are useless, we forward the ALUop to ALU (ie 00 -> add, 11 -> sub, 01 -> passb). in this case ADD is R-format, so funct bits are used.

eg. ADDI X1, X2, #4 is I-format
PC operations are same as above. first

to the right:

eg. LDUR X1, [X2, #400] is D-format

img

Week 7. June 22

eg. CBZ X1, #100 is CB-format

img

at the top:

at same time:

at the right:

single-cycle control

overview:

img

operation ALUop opcode ALU control input action
LDUR 00 0010 add
STUR 00 0010 add
CBZ 01 0011 passb
ADDI 00 0010 add
SUBI 11 0110 sub
ADD 10 100010 11000 0010 add
SUB 10 110010 11000 0110 sub
AND 10 100010 10000 0000 and
ORR 10 101010 10000 0001 or

timing

eg.

img

eg. now assume two top PC adders also take 200ps. timings stay the same because top two adders run in parallel with bottom.

eg. now assume Control mux takes 10ps. then R-formats takes extra 10ps (reads both reg, have to wait for mux); LDUR does not (it only reads one reg); STUR takes extra 10ps; CBZ takes extra 10ps (its reg goes to Mux)

modifying data path

eg. B #400

31-26 is opcode, 25-0 is BR address. also compute PC + 4*offset.

type Reg2Loc ALUSrc MemToReg RegWrite MemRead MemWrite Branch ALUop1 ALUop0 Uncond branch
B x x x 0 0 0 x x x 1

Uncond branch is 0 for other instructions.

img

needs to take 400ps in above settings.

memread, memwrite, regwrite must not be don't care

Week 8. June 29

multicycle datapath

img

pipelined datapath

eg. at the end of every clock cycle, PC is updated to PC+4 instantly

img

pipelined control

pipeline hazards

hazard: event that blocks normal flow of instructions through pipeline

data hazard

eg.
img

solution2: in SUB, X2 is computed in ALU in CC3, in AND, X2 is an input of ALU in CC4. we can forward the computed value directly.
img

forwarding datapath

img

eg. have two R-formats. if destination register of one instruction is the first source of the next instruction, then the condition is

  1. EX/MEM.RegisterRd = ID/EX.RegisterRn1
  2. EX/MEM.RegWrite = 1

where

eg.

ADD X2, X6, X5 AND X12, X2, X5

img

eg. load-use hazard: this does not work if the first instruction is LDUR X2, [...] since the value is not ready. we have to delay it (by inserting a NOP between LDUR and AND), and do the forwarding next stage

LDUR X2, [X5, #100] // NOP should be followed AND X12, X2, X5

img

eg. consider code

1. LDUR X1,[X2,#10] 2. LDUR X3,[X5,#10] // load-use hazard 3. ADD X3,X5,X3 4. STUR X3,[X5,#10] 5. LDUR X1,[X6,#10] // load-use hazard 6. ADD X3,X1,X4 7. STUR X4,[X3,#10] 8. ADDI X6,X4,X4

Week 9. July 6

conditions of forwarding

forwarding helps everything except LDUR.

load-use stalling

img

conditions

code rearrangement with data forwarding

eg. insert line 5 between 2 and 3 in the last eg to avoid NOP.

timing

eg. in the last eg, it takes 8 (each) + 2 (two load-use stalling) + 4 (start) = 14 cycles

control hazard (CBZ)

eg.
img

before the branching decision is made in the branch in mem stage, three instructions (in IF, ID, EX stages) already began, we might want to cancel them if the branch condition is met.

branch in ID stage

img

eg. what is the running time for

100 ADDI X1,XZR,#20 104 ADDI X2,XZR,#0 108 LDUR X3,[X4,#0] // loop 112 ADD X2,X2,X3 // hazard <- insert stalling 116 ADDI X4,X4,#8 120 SUBI X1,X1,#1 124 CBNZ X1,#-4 // every time this jumps, add flushing 128 SUB X5,X5,X2 132 LDUR X3,[X6,#100] 136 ADDI X29,XZR,#100

assuming forwarding, stalling, and branching in MEM

if there is no flushing, we have to insert 3 NOPs after line 124.

assuming branching in ID

if there is no flushing, we have to insert 1 NOP after line 124.
if we swap line 112 and 120, both data hazard in 112 and branch data hazard in 124 are also resolved -> time: 168-20(2)=128

a branch data hazard can exist in any pipelined datapath when forwarding is not present.

branch prediction

in the IF stage

img

eg. strategies:

100 ADDI X4,X31,#6 ... 120 SUBI X1,X1,#1 124 CBNZ X4,#-5 128 ADD X1,X2,X3

2-bit prediction

img

eg. for the above eg

performance

Week 10. July 13

memory hierarchies

principles of locality:

speed CPU size cost current technology
fastest memory smallest highest SRAM
memory DRAM
slowest memory biggest lowest magnetic disk

cache associativity

ignore L2, L3; imagine we only have L1 cache (instruction memory/cache, data memory) and main memory.

direct mapped cache

eg. one way associative: items map to one and only one location

when CPU needs memory access at address abcde, cde called lower order 3 bits, ab called tag bits.

img
V: valid bit: cache is meaningful or empty cache

eg. 2-way set associative cache: 2 rows

memory address being abcde, for cache, index bit is de, have two tag fields, each 3 bits (abc)

img
generally increase number of hits

eg. 4-way set associative cache: 2x4 configuration

eg. fully associative cache: 1x8 configuration, only one unique addr, 8 tag fields, 5 tag bits

eg. implementation of 4-way associative cache
img
note all 4 columns are compared against tag bits in parallel.

eg. as associativity (column) increase, lookup time remains the same but hardware increases, number of possible address in cache increases.

replacement scheme

eg.
img

larger block size

eg.
img

eg.
img
run time: 104*3 (misses) + 104*1 (writeback) + 6(instr), assuming no hazard

Week 11. July 20

virtual memory

img

page table

img

when program 1 loaded:

remark. all pages are always on disk.

page replacement policy:

translation look aside buffer (TLB)

img

if Ref bit is on, that means the address translation exists in both page table and TLB. if we kick out the page from RAM (thus from page table), we release both page table and TLB references (set valid to 0, if dirty is 1, write back). we also remove cache entry.

img

img

remark. there are two levels of dirty bits: page level and cache level. for STUR, we update cache, page, then to disk.

swapping between two processes