page: https://www.student.cs.uwaterloo.ca/~cs251/S20/index.shtml

Instructor: Rosina Kharal, Mohammad Mazaheri Kalahrody

Week 1. May 11
Week 2. May 18
Week 3. May 25
Week 4. June 1
Week 5. June 8
Week 6. June 15
Week 7. June 22
Week 8. June 29
Week 9. July 6
Week 10. July 13
Week 11. July 20

Week 1. May 11

Q1-Q12, A0-A6, T1-T3

defn. assembly language is any low-level programming language with a very strong correspondence between the instructions in the language and the architecture's machine code instructions.

                  compiler(cs241)     assembler(cs251)
high level language    ->    assembly    ->    binary machine code

ARM overview

computers execute assembly instructions
- in binary on computer, but text for people
only simple operations: add, sub, goto, conditional goto
instructions operate on two types of data:
- register: high speed access
- ram: slow to access

                                                                        +---------------------+
                                                                        |  main program       |
      +------------------------------------------------------+          |                     |
      | CPU              main processor                      |          |                     |
      |                                                      |          |                     |
      |    +----------+                      +-----------+   |          |+-------------------+|
      |    |          |                      |           |   |          ||my program code data|
      |    |          |                      |           |   |          ||                   ||
      |    |instruction  +---------------+   |data       |   |          ||000 ...            ||
      |    |memory    |  | register file |   |memory     |   |          ||008 ...            ||
      |    |          |  |X0  2.fetch    |   |           |   |          ||                   ||
      |    |1.all code|  |X1  LDUR and   |   |1.all data |   |          ||                   ||
      |    |is copied |  |    execute..  |   |is copied  |   |          ||                   ||
      |    |here:     |  |    4.fetch    |   |here       |   |          ||                   ||
      |    |          |  |   ADD and exec.   |           |   |          ||                   ||
      |    |000 LDUR ..  |   5.fetch STUR|   |200 arr[0] |   |          ||                   ||
      |    |004 ADD ..|  |   and exec    |   |           |   |          ||                   ||
      |    |          |  |               |   |3.data     |   |          ||                   ||
      |    |          |  |X31/XZR        |   |accessed   |   |          ||                   ||
      |    |          |  +---------------+   |here       |   |          ||                   ||
      |    |          |                      |6.data     |   |          ||                   ||
      |    |          |                      |stored here|   |          ||                   ||
      |    +----------+                      +-----------+   |          ||                   ||
      |                                                      |          |+-------------------+|
      +------------------------------------------------------+          |                     |
                                                                        |                     |
                                                                        +---------------------+

fetch-execute cycle:

register

each register holds max 64 bits. there are 32 registers: X0,X1,...,X31/XZR

eg. exec in register

// assume values are preloaded into X... registers
// X1: f, X2: g, X3: h, X4: i, X5: j

// f = (g + h) - (i + j)

// equivalent ARM:
//  dest lhs rhs
ADD X6, X2, X3  ;; X2, X3 not changed
ADD X7, X4, X5  ;; X6, X7 hold temp value
SUB X1, X6, X7  ;; X1 has the final value

// then put value in X1 to f

types of arm instructions:

R-format
- ADD X1, X2, X3 adds content of X2 to contents of X3; store result in X1
D-format
- LDUR X1, [X2, #20] load data from address X2+20 into X1
- STUR X1, [X2, #30] store data from X1 into address X2+30
I-format
- ADDI X1, X2, #100 adds immediate value 100 to contents of X2; store result in X1
B-format
- B #28 go to 28 instructions later
CB-format
- CBZ X1, #8 if X1 is 0, then go to 8 instructions later, else next instruction

there are instruction memory and data memory.

each instruction requires 32 bits (4 bytes)

eg. program in memory

//  PC      instruction
    000:    ADD X1,X2,X3
    004:    SUB X1,X3,X5
    008:    ADDI X2,X12,#16

first instruction line takes 32 bits to specify; start byte is 0; when i request byte 0, i also get 4 bytes in total (0-3)

branch

jump to x instructions later. add constant*4 to PC

100: B #3  // go to 100+(12)=112
104: ...
108: ...
112: ...

conditional branch `CBZ`

compare contents of register X1 to zero, if equal, add constant*4 and add to PC
if not, the instruction following branch is executed
constant can be negative
CBNZ similar but if X1 is zero
the branch target address ie PC += PC + offset*4

eg.

// multiply 5 six times
100: ADD X1, XZR, XZR   // XZR(X31) is 0. clear the content of X1
104: ADDI X2, XZR, #6   // store 6 to X2
108: ADDI X1, X1, #5    // X1 += 5
112: SUBI X2, X2, #1    // X2 -= 1
116: CBZ X2, #-2        // if X2 is 0, go to 116+(-8)=108, otherwise go to 120
120: ...

memory access

32 registers are not enough to store data
they are special arm instructions to access 2^64 byte RAM
M is data memory
100: LDUR X1, [X2,#100] read from M[100+X2] store in X1
100: STUR X1, [X2,#100] write X1 value to M[100+X2]
all addresses are multiple of 8

eg.

// f = (g + h) - (i + j)

// assume X10 contains the start address in mem of the data value f, and
// g, h, i, j are consecutive. each variable is 64-bit integer
LDUR X2, [X10, #8]  // g into X2
LDUR X3, [X10, #16] // h into X3
LDUR X4, [X10, #24] // i into X4
LDUR X5, [X10, #32] // j into X5

ADD X6, X2, X3
ADD X7, X4, X5
SUB X1, X6, X7

STUR X1, [X10, #0]  // X1 into f

eg. read data located in mem at address 240, take data and write it to register 7

LDUR X7, [X31, #240]

Week 2. May 18

logic blocks

combinations: without memory
sequential: with memory
inputs and outputs are 1/0 (high/low voltage, true/false)

review electricity: V = IR

digitizing

discrete signal

value
+
|  1
+-------+         +----------+
|       |         |          |
|       |         |          |
|       |   0     |          |
|       +---------+          +-----------+
|
+----------------------------------------+
                                   time

only high (1) or low (2)
DC: direct current flows in one direction through a circuit
in reality, the transformation from hi to lo could lead to intermediate values, which need to be designed out

 A    +--------+
+---> |        |   F
 B    |        +-------->
+---> |        |
 C    |        |
+---> +--------+

call these digital circuits, with inputs/outputs that can take only 1 of two possible values 0, 1
we analyze the 'work' or 'function' of a circuit based on the output it is generating. the output depend on inputs
First: circuit analysis via truth tables
second: look at how exactly a binary 0 and 1 are represented

specifying input/output behaviors

truth tables

OR, AND, NOT, NOR, NAND, XOR:

A	B	A+B	AB	$\overline{A}$	$\overline{A+B}$	$\overline{AB}$	A^B
0	0	0	0	1	1	1	0
0	1	1	0		0	1	1
1	0	1	0	0	0	1	1
1	1	1	1		0	0	0

eg. conclude a formula from a truth table ((A,B,C)->F)
look at F's that have 1, use the product of A,B,C,~A,~B,~C to assemble 1

A	B	C	F	$\overline{A}\cdot\overline{B}C$	$A\overline{B}C$	$ABC$	$\overline{A}\cdot\overline{B}C+A\overline{B}C+ABC$
0	0	0	0				0
0	0	1	1	1			1
0	1	0	0				0
0	1	1	0				0
1	0	0	0				0
1	0	1	1		1		1
1	1	0	0				0
1	1	1	1			1	1

(empty cells are 0)
we can say the formula that represents blackbox (A,B,C)->F is $\overline{A}\cdot\overline{B}C+A\overline{B}C+ABC$ .

don't cares in truth tables

those represented as X instead of 0/1
when used in output, indicates that we don't care the output for that input
when used in input, indicates outputs are valid for all inputs created by replacing X by either 0 or 1

eg.

A	B	C	F
...	...	...	...
1	0	1	1
1	1	0	1
1	1	1	1

simplified:

A	B	C	F
...	...	...	...
1	0	1	1
1	1	X	1

eg.

A	B	C	F	$\overline{A}\cdot\overline{B}C$	$AC$	$\overline{A}\cdot\overline{B}C+AC$
0	0	0	0			0
0	0	1	1	1		1
0	1	X	0			0
1	X	0	0			0
1	X	1	1		1	1

laws of boolean algebra


$\overline{\overline{X}}=X$
$X+0=X$	$X\cdot1=X$	identity
$X+1=1$	$X\cdot 0=0$	zero/one
$X+X=X$	$XX=X$	absorption
$X+\overline{X}=1$	$X\overline{X}=0$	inverse
$X+Y=Y+X$	$XY=YX$	commutative
$X+(Y+Z)=(X+Y)+Z$	$X(YZ)=(XY)Z$	associative
$X(Y+Z)=XY+XZ$	$X+YZ=(X+Y)(X+Z)$	distributive
$\overline{X+Y}=\overline{X}\cdot\overline{Y}$	$\overline{XY}=\overline{X}+\overline{Y}$	DM

eg. simplify a formula

$\begin{aligned} F &=\bar{X} \bar{Y} Z+X \bar{Y} Z+X Y \bar{Z}+X Y Z \\ &=\bar{Y} Z(\bar{X}+X)+X Y(\bar{Z}+Z) \\ &=\bar{Y} Z(1)+X Y(1) \\ &=\bar{Y}Z+XY \end{aligned}$

q2: C

gates

2-input AND, 2-input OR, 1-input Invertoer
each input is one bit wide
NOT operator often drawn as "bubble" on input or output

eg.
nice clean circuit: straight lines

we have $\mathrm{F}=\mathrm{X} \overline{\mathrm{Y}}\cdot \overline{\mathrm{Z}}+\mathrm{XY} \overline{\mathrm{Z}}+\overline{\mathrm{X}} \mathrm{YZ}$

eg. deriving truth table from circuit

first label all intermediate values (A,B,C)
A is OR gate, with two inputs (X,Y) negated.
B is NAND gate, with two inputs (X,A)
C is NAND gate, with two inputs (Y,A)
F is OR gate, with two inputs (B,C) negated

xor

X	Y	A	B	C	F
1	1	0	1	1	0
1	0	1	0	1	1
0	1	1	1	0	1
0	0	1	1	1	0

transistors

implementing gates with transistors

transistor: an electrically-controlled switch
NMOS transistor (n-transistor):
by control the voltage of gate, source, drain, we can control the current flow

eg.

assume a voltage 5V at drain
current wants to flow from Drain to Source (ground in this eg)
Gates acts as a barrier to this current
by increasing the voltage at gate, the barrier is removed and current flows
can model transistors as resistance
- high Vgate: low resistance -> current flows
- low Vgate: high resistance -> no current
rule: if there is direct path way from power to ground, then F is low

NMOS as Not gate

input	resistance Q1	F
0	high	1
1	low	0

arrow out from gate toward source
VDD is power source, G is gate, S is source (ground), F is output
A is low -> gate low voltage -> high resistance Q1 -> F is connected to VDD -> F is high
- current is blocked to ground, so all of it goes to F
A is high -> gate high voltage -> low resistance Q1 -> F is connected to ground -> F is low
- current is not blocked, so it can go to ground, a little (0.8V) current goes to F
in real world, there is a resistor between power and drain to prevent burning out

PMOS

input	resistance Q1	F
0	low	0
1	high	1

reverse of NMOS
arrow towards gate out from VDD

A=1 is 5V, A=0 is 0.8V.

NMOS is not perfect, it transmits strong 0, but weak 1 (like 4V)
PMOS transmits strong 1, but weak 0

CMOS (complementary MOS transistor)

A	Q1	Q2	F
0	low	high	1
1	high	low	0

CMOS uses both NMOS and PMOS (circle means PMOS), NMOS grounded, PMOS at power
Qs are inner resistances of each transistor
no bad flow of current (short circuit)
no weak transmissions
when A is 0, Q2 blocks current to the ground so it goes to F
when A is 1, Q2 is "ON", the internal current will be enough to feed ground

eg. what will the signal be at B?
the only way B is 0 is both Q1,Q2 are low, which is not possible in the configuration. so B is always 1.
in other cases B and F are in parallel, so voltages are equal

eg. what will the signal be at C?
C is always 0 because it is always grounded.

CMOS NAND

A	B	Q1	Q2	Q3	Q4	Z
0	0	low	low	high	high	1
0	1	low	high	low	high	1
1	0	high	low	low	high	1
1	1	high	high	low	low	0
		h	h	h	h	float	bad state
		l	l	l	l	short circuit	bad state

NAND uses 4 transistors
Q3 and Q4 are in series, if one of them has high Q, then power->ground is blocked
if Q3 and Q4 are both low, they are "ON", generating enough current to feed ground

eg. which resistances must be low to have a short circuit?
one of Q1,Q2, both of Q3,Q4

eg. which resistances must be high to have a float
both of Q1,Q2, one of Q3,Q4

2-input AND

add one NOT gate
uses 2 more transistors

3-input NAND

Week 3. May 25

decoders

translate the n-bit input into a signal that corresponds to the binary value of the n-bit input
$n$ inputs map to $2^n$ outputs possible
only one input is high for each input combination

eg. 3-to-8 (or 3-bit) decoder

dec	A2	A1	A0	D7	D6	D5	D4	D3	D2	D1	D0
0	0	0	0	0	0	0	0	0	0	0	1
1	0	0	1	0	0	0	0	0	0	1	0
2	0	1	0	0	0	0	0	0	1	0	0
3	0	1	1	0	0	0	0	1	0	0	0
4	1	0	0	0	0	0	1	0	0	0	0
5	1	0	1	0	0	1	0	0	0	0	0
6	1	1	0	0	1	0	0	0	0	0	0
7	1	1	1	1	0	0	0	0	0	0	0

we can draw the decoder using symbol

implementation using only AND gates

used three switches to respond to 8 outputs

memory expansion

multiplexer

also called a selector
choose one input from $2^n$ inputs and reflect it on the output
input: $2^n$ lines (D0,...,D2n-1), $n$ select lines (Sn-1,...,S0)
output: one single output based on one source of input data

q1 2

eg. 4-1 multiplexer

there are 4 lines to select from D0,D1,D2,D3
there are 2 select lines S0,S1
if S0,S1 are both 0, then 00 means zero, first line is selected
note S0 is always the lower order bit

implementation:

if S0,S1 are both 0, then only the first gate has both 1, all other three are short circuit. the result of Y is decided by D0.

eg. realize truth table using one mux

         X---X
       +X     X
       |X     X
       |      +             X  Y  Z |F
 Z +---+D0    |             0  0  0 |0
       |      |             0  0  1 |1
       |      |             0  1  0 |0
 0 +---+D1    |             0  1  1 |0
       |      +------+F     1  0  1 |0
       |      |             1  0  0 |1
~Z +---+D2    |             1  1  0 |1
       |      |             1  1  1 |1
       |      |
 1 +---+D3    |
       |      X        (when X=Y=0, F is decided
       +X     X         by D0, we see Z behaves
        X ---X          the same as F...)
         +   +          when X=0,Y=1,...
         |   |
         +   +
        S1   S0
         X   Y

array

slash notation is used to indicate lines carrying multiple bits, implying parallel constructions
externally looks like 1 mux that takes 2 inputs and select 1
note only one bit is needed to selected between A and B in this eg
output bits C0,...,C63 combine to result in C

q1 d

clocks and sequential circuits

two types: synchronous (has a clock memory changes only at discrete points of time)
asynchronous: no clock potentially faster and less power-hungary
clocks: controlled input signal into the system
clock is high = 1, clock is low = 0
from one rising edge the next rising edge is a cycle

   rise       fall        1
      +-------+       +-------+
      |       |       |       |   0
+-----+       +-------+       +------+

      <--------------->
         clock cycle

SR latch

simplest block of memory
a block to store a binary 0 or 1
Set means set Q to 1
Reset means set Q to 0
is not clocked

unclocked memory bit: 1 bit recycles through gates

latch store S=0, R=0. for any previous value (Q, ~Q), the outputs (Q, ~Q) don't change

latch set S=1, R=0. output becomes (Q, ~Q) = (1, 0)

latch reset S=0, R=1. output becomes (Q, ~Q) = (0, 1)

once set or reset completes, the system will return to steady state S=0, R=0. so new value stored.

latch oscillate S=1, R=1. undesirable behavior. should be avoided.

when S=R=1, the output becomes (Q, ~Q) = (0, 0)
but if then change to S=R=0, the output becomes (1, 1) and then to (0, 0), ...

D-latch

add a clock pulse
use one single input source D
controls the inputs to the S and R such that they can never both be 1

C	D	next state of Q
0	x	no change
1	0	Q=0 (reset)
1	1	Q=1 (set)

only can set/reset when clock is 1, otherwise Q stays same (red, 1st and 2nd vertical line)
drawback: any change in D during high clock pulse will be reflected in Q (green), no longer uniform (discrete point in time)
22 transistors needed in a d-latch (2+6+6+4+4)

D Flip-Flop

state only to be affected at discrete points in time
master (left) slave (right) design
QE related to original clock: update occurs on falling edge of clock pulse only

d flip flops used for each bit of each register in the register file
64-bit register needs 64 d-flip flops
useful for small memory eg RF ^

Week 4. June 1

register file

stuff is computed during a clock cycle, after it finishes stuff is written on the falling edge of a clock cycle

max number of registers to read from is 2 (add x1,x2,x3) (two 5bit numbers to specify because we have 31 registers)
number of registers to write to is 1 (one 5bit number to specify)

   5bit +---------------------------------+
+-------> read                  read data |   64bit
        | register #1           output #1 +--------->
   5    |                                 |
+-------> read                            |
        | register #2                     |
   5    |                                 |
+-------> write                           |
        | register #                      |
   64   |                       read data |   64
+-------> write                 output #2 +--------->
        | data                            |
        |            write                |
        +--------------^------------------+
                       |control bit
                       |
                       +

control bit is 1 if one wants to write stuff (add, ldur), it is 0 if not (b, cbz).

write operation

data goes to every register (D) even not all register gets modified
the write (control) bit and the output of the decoder decides whether register is enabled, which one is enabled
control bit (C) is only true for one register being chosen
clock is external, not in diagram!

note the number of register does not depend on architecture.

eg. 64bit architecture, 16 register in total. then

each reg still stores 64 bits
64 bits of write data are provided
1 bit is for the write bit.

the only difference is the size of decoder (input/output).

read operation

we read two registers in parallel.

throw out all the contents of registers, and fed them into two muxes
the select line (5bits), which are the bits to specify regs, chooses one data to output
(note that the two big muxes are actually series of 32x1 1bit muxes)

we never do both read and write at the same time

finite state machine (Moore)

eg. traffic lights

         +  N  +
         |     |
         |     |
         +-----+
 NS light+-----+
+--------+     +-------+
                ||
W               ||     S
                ||
+--------+     +-------+
         |     |SW light
         |     |
         |     |
         +  S  +

NSlight, EWlight are states of the two traffic lights
state is one of red, green
EWcar means there are cars detected in east-west direction
- then EW light is green (EWgreen)
NScar means there are cars in north-south direction
- then NS light is green (NSgreen)

one bit is used to indicate green/red.

names of states outside ovals (NSgreen, EWgreen)
output in given state inside oval
transition arc labelled with boolean formula of inputs (bar indicates not)
must stay in one state for at least one clock cycle

eg. extend the traffic light to handle yellow light

g means green, r means red, y means yellow
TS indicates whether 28 secs have passed (as input) (green must stay over 28 secs)
- so 28 is multiple of one clock cycle
TR bit means to reset the above counter (as output)

state transition table:

current state	inputs	next state
S2 S1 S0	NScar EWcar TS	S2' S1' S0'
000	XX0	000
000	XX1	001
001	X0X	001
001	X1X	010
010	XXX	100
011	XXX	XXX	don't care
100	XX0	100
100	XX1	101
101	0XX	101
101	1XX	110
110	XXX	000
111	XXX	XXX	don't care

X bit means don't care
011 and 111 never happen

use the min terms (output bit to be 1), we can conclude formula for each bit

$S_0'=\overline{S_1}\cdot\overline{S_0}\cdot TS+\overline{S_2}\cdot\overline{S_1}\cdot S_0\cdot \overline{EWcar}+S_2\cdot\overline{S_1}\cdot S_0\cdot\overline{NScar}$
$S_1'=\overline{S_2}\cdot\overline{S_1}\cdot S_0\cdot EWcar+S_2\cdot\overline{S_1}\cdot S_0\cdot NScar$
$S_2'=\overline{S_2}\cdot S_1\cdot\overline{S_0}+S_2\cdot\overline{S_1}$

output table:

only care about current state and its corresponding output
(output is not related to input)
if bit is present, it is 1

S2 S1 S0	NSg NSy NSr EWg EWy EWr TR
000	1000010	NSg,EWr
001	1000010	NSg,EWr
010	0100011	NSy,EWr,TR
011	XXXXXXX
100	0011000	NSr,EWg
101	0011000	NSr,EWg
110	0010101	NSr,EWy,TR
111	XXXXXXX

$NSg=\overline{S_2}\cdot\overline{S_1}$
$EWg=S_2\cdot\overline{S_1}$
$NSy=\overline{S_2}\cdot S_1\cdot\overline{S_0}$
$EWy=S_2\cdot S_1\cdot\overline{S_0}$
$NSr=S_2$
$EWr=\overline{S_2}$
$TR=S_1\cdot\overline{S_0}$

eg. can draw a circuit for one state and one output of them

(triangle means clocked)

Week 5. June 8

armv8 word:

1 byte = 8 bits; 4 bytes = 1 word; 8 bytes = 1 dword
bits numbered 63,62,...,0
MSB is 63, LSB is 0

ASCII

...

7 bits for character code, 1 leftmost bit for checking
leading 1 -> even number of 1's in the rest 7 bits eg 1 0011011
leading 0 -> odd

unsigned and signed binary numbers

see cs241 note

eg. with 4 bits, 1110 is -6. what's this number with 8 bits?
copy old MSB to the new positions, answer is 1111 1110.

logic operations


	1101
arithmetic shift right	1110
logical shift right	0110
shift left	1010
circular shift right	1110	use overflown bit for void space
circular shift left	1011
bitwise and, or, xor, nor

full adder

basic building block of addition circuit

A	B	CarryIn	CarryOut	Sum
0	0	0	0	0
0	0	1	0	1
0	1	0	0	1
0	1	1	1	0
1	0	0	0	1
1	0	1	1	0
1	1	0	1	0
1	1	1	1	1

$S=A \text{ xor } B\text{ xor }C_\text{in}$
$C_\text{out}=AB+AC_{\text{in}}+BC_\text{in}$

4-bit ripple-carry adder

connect carries
S3S2S1S0 is the output
drawback: slow, "carry-lookahead" idea can be used

arithmetic logic unit

1-bit ALU

extends functionality of full adder
input is 1 bit
can compute AND, OR and add

operation is the input to the mux (0,1,2,3)

ALU operation	function
Ainv Binv Op1 Op0
0000	AND
0001	OR
0010	add
0110	sub
1100	NOR
0011	Pass	result=B

note Binv is connected to the first CarryIn

eg. to compute subtraction a-b=a+(-b), we invert b, add to a, then add 1.

eg. to compute A NOR B, we invert both a and b, and perform A AND B. (since $\overline{A+B}=\overline{A}\cdot\overline{B}$ )

eg. it can also do NAND.

64-bit ALU

we can chain 64 of these to support 64 bit arithmetic:
given two n bit input, the result is n bit

eg. overflow happens only if A, B have same sign and the result has different sign, iff the final sign bit's CarryIn and CarryOut are different.
if two operands have different sign, the final CarryOut is ignored if there is.

we use the following notation to represent ALU:

multiplication

input: two n bit numbers, output: 2n bit number
multiplicand * multiplier = result

procedure for doing 64bit multiplication -> result is 128 bit:

                +-----+
                |start|
                +--+--+
                   |
                   | <--------------------+
                   v                      |
Multiplier0=1 +----+-----+  Multiplier0=0 |
     +------+ |1.test    |-----+          |
     |        |Multiplier0     |          |
     |        +----------+     |          |
     v                         |          |
+----+-------+                 |          |
|1a.add      |                 |          |
|multiplicand|                 |          |
|to product; |                 |          |
|place result|                 |          |
|to Product  +---+         <---+          |
|register    |   |         |              |
+------------+   |         |              |
                 v         v              |
            +----+---------+-----+        |
            |2.shift multiplicand|        |
            |register left 1 bit |        |
            +---------+----------+        |
                      |                   |
            +---------v-----------+       |
            |2.shift multiplier   |       |
            |register right 1 bit |       |
            +--------+------------+       |
                     |                    |
               +-----v-----+              |
               |  64 reps? +--------------+
               |           |   No
               +-----------+
                     |Yes
                     v
                  +--+-+
                  |done|
                  +----+

the hardware:

floating point number

to represent a fractional number

uses idea of scientific notation: fraction, mantissa, exponent
normalized: single digit to left of decimal point
use 2 as base
eg: $1.01_2\times 2^4$ (normalized binary in scientific notation)
in normalized binary, leading digit of significant is always 1

eg. convert a binary point number to decimal

10.10101   -> 1*2^1 + 0*2^0 + 1*2^-1 + 0*2^-2 1*2^-3 + +0*2^-4 + 1*2^-5
           -> 2.65625

Floating Point registers

separate stack of FP
each FP register is 32 bits for single precision. double precision uses 64 bits or 2 consecutive FP registers

IEEE 754

arm uses the IEEE 754 floating point standard format

31   30            22                           0
+---+-------------+-----------------------------+
| S |  exponent   |       significand           |
+---+-------------+-----------------------------+
1bit    8bits                23bits

normalized numbers: always assume there exists a leading 1 in the number (save one bit). the only exception is when the number is 0
biased exponent: x-127 (x is the value in the exponent field)
sign: 1 means negative, 0 means positive

interpreted value from the hardware is

$(-1)^\text{Sign}\times(1+\text{significand})\times 2^\text{exponent - bias}$

where bias=127 for single precision.

exponent is stored in biased notation: so 8 bit exponent field is a positive value. it is easier to
special cases: exponents of 0000 0000 and 1111 1111 are reserved
range of exponents: -126 (0000 0001) to +127 (1111 1110)

eg. i want an exponent of -50: 2^-50. what is my biased 8 bit representation?
77-127=-50 so it is 77.

eg. i want an exponent of -50: 2^50. what is my biased 8 bit representation?
177-127=50 so it is 177.

eg. convert 0.625 decimal to binary
multiply fraction by 2 repeatedly:


0.625*2=1.25	keep 1 as first digit	0.1
0.25*2=0.5	keep 0 as next digit	0.10
0.5*2=1.0	keep 1 as next digit	0.101

rest digit is 0 so it is done. the result is precisely 0.101.

eg. convert 2.625 to binary
2 is 10 as binary,
.625 is .101 as binary.
so 2.625 is 10.101 * 2^0 as binary.
need to normalize, so it should be 1.0101 * 2^1.
SIGN BIT: 0 (positive)
exponent: x-bias should give 1+127=128
significant bits: 0101
hence answer is 0 10000000 01010000000000000000000.

eg. convert 0.1 to floating point number


0.1*2=0.2	keep 0 as first digit	0.0
0.2*2=0.4	keep 0 as next digit	0.00
0.4*2=0.8	keep 0 as next digit	0.000
0.8*2=1.6	keep 1 as next digit	0.0001
0.6*2=1.2	keep 1 as next digit	0.00011
0.2*2=0.4	repeat

repeating pattern is $0.0\overline{0011}$ which is infinite. have to truncate.

eg. what is binary reading from 0 01111111 10101000000000000000000
exponent bit is 01111111 -> 127, so real exponent is 127-127=0. the binary number is 1.10101 * 2^0.

eg. overflow is detected using exponent. 1.010101010101010101010101010101010101010101 * 2^-129 has overflow, but that is determined by "-129" not the precision bits (which are truncated anyways).

addition

decimal example: 9.54*10^2 + 6.83*10^1
match exponents: 9.54*10^2 + 0.683*10^2
add significand with sign: 10.223*10^2
normalize: 1.0223*10^3
check for exponent overflow
round: 1.02*10^3
may have to normalize again

multiplication

decimal example: 9.54*10^2 * 6.83*10^1
add exponents: 2+1=3
multiply significands: 9.54*6.83=65.1582
result: 65.1582*10^3
normalize: 6.51582*10^4
check overflow
round: 6.52*10^4
may have to normalize again
set sign
same for binary number

Week 6. June 15

memory hierarchy

processor works with memory that is closest to it
- register file, instruction & data mem (cache)
- instructions and data stored in local caches close to the processor
main levels: caches, main memory (RAM), hard drive (Disk)

+---------------------------------------+   +------+
| +------+ +-----------------+  +-----+ |   |      |
| |      | |                 |  |     | |   | DRAM |
| | SRAM | | REGISTER FILE   |  |SRAM | |   |      |
| |      | |                 |  |     | |   |      | DISK
| |      | |                 |  |     | |   |      |
| |      | |                 |  |     | |   |      |
| |      | |                 |  |     | |   |      |
| +------+ +-----------------+  +-----+ |   |      |
+---------------------------------------+   +------+

static random access memory (SRAM)

uses D latches
register file worked for smaller memory and would not scale up (decoders and mux too big). we use three-state buffers to fix mux problems and two-level encoding to fix decoder problem
updates to memory occur in one clock cycle

            21bits
    address +----> +-------------+
                   |             |
                   |             |
chip select +----> |             |
                   |   SRAM      | 16
    output  +----> |   2M x 16   +------>Dout[15-0]
                   |             |
write enable+----> |             |
                   |             |
            16bits |             |
   Din[15-0]+----> +-------------+

address is 21 bits and data is 16 bits in this eg.

three-state buffer

we have multiple of a bit cell
all data from memory is transmitted back to the process on the same path (transmission bus), so must have the ability of allow an address but block all others

has three outputs 0, 1, floating (connected to neither power or ground)
uses a PMOS and a NMOS

C	X	F
1	1	1	both low resistance, transmit as-is
1	0	1	both low resistance, transmit as-is
0	x	-	both high resistance, both transistors off

eg. 3-state buffer is depicted by triangle. use them to select a data

behaves like mux, but has 4 outputs, 4 select bits
allows multiple inputs to be tied to one source, but one and only one input is allowed to continue as output, or short-circuit.

structure of SRAM

in this eg, word size is 2 bits (2 columns), there are 4 addresses (word lines)
the "enable" bit is connected to 3-state buffer as c

better SRAM uses min 6 transistors / bit, register file 46 transistors / bit.

dynamic RAM (DRAM)

SRAM still uses a lot of transistors
use a capacitor to store a charge to represent 1, no charge to represent 0
use 1 capacitor to store data. use 1 transistor to connect to it r/w bit
must refresh charge. refresh is handled a row a time (external controller). fraction of timed to refresh is max 4% or 1-2% of active clock cycles. not heavy penalty.
SDRAM: synchronized DRAM (with external clock)

by using two level encoding, we split the address, half go to the decoder to select word, half go to the mux to select a particular bit out of the word.

important:

SRAM: expensive but fast. used in caches
DRAM: cheaper but slower. always need to refresh the capacitors. a read from a capacitor will dissipate the charge. for read, need to read the charge and write it back
DRAM: main memory typically implements DRAM, SDRAM, SDDRAM etc

single cycle data path

high level view

at beginning of a cycle, PC split: one current PC goes to right, one is the computed PC+4 (top); the computations of PC+4 and the execution of current instruction happen in parallel; when a cycle finishes, it will decide whether the next instruction is PC+4, or branch target address (that is computed using CBZ).
- actual PC is not updated until the end
- the top right Adder is for computing target address
- PC is stored in a register (64 bits) with d flip flop
for like ADD, two registers X2,X3 go to ALU, result goes back to register file as data, the specified register X1 is written.
ALU has 4 select bit. two are op, two are A,B-invert
i am allowed to do read/write at the same time in register file

instruction formats

an instruction is specified by 32 bits

R-format:   ADD X1, X2, X3      =>   ADD Rd, Rn, Rm
    opcode (31-21)
    Rm (20-16)
    shamt (15-10)
    Rn (9-5)
    Rd (4-0)

I-format:   ADDI X1, X2, #4     =>   ADDI Rd, Rn, immediate
    opcode (31-22)
    ALU immediate (21-10) (unsigned 12 bit)
    Rn (9-5)
    Rd (4-0)

D-format:   LDUR X1, [X2, #200] => LDUR Rd, [Rn, immediate]
    opcode (31-21)
    DT address (20-12) (signed 9 bit)
    op (11-10) (not used)
    Rn (9-5)
    Rd (4-0)

B-format:   B #3000
    opcode (31-26)
    BR address (25-0) (signed 26 bit)

CB-format:  CBZ X1, #3000 => CBZ Rt, immediate
    opcode (31-24)
    COND BR address (23-5) (signed 19 bit)
    Rt (4-0)

all R-formats specify the instruction bits exactly the same. the only difference is the type of operation performed by ALU - a separate controller ALU control unit determines what operation
opcode contains a portion called funct bits, which only exist for R-formats

opcodes

instruction	opcode	format
B	0001 01	B
ADD	1000 1011 000	R
ADDI	1001 0001 00	I
CBZ	1011 0100	CB
CBNZ	1011 0101	CB
SUB	1100 1011 000	R
SUBI	1101 0001 00	I
STUR	1111 1000 000	D
LDUR	1111 1000 010	D

full data path

walking through data path

Control bits:

type	Reg2Loc	ALUSrc	MemToReg	RegWrite	MemRead	MemWrite	Branch	ALUop1	ALUop0
R-format	0	0	0	1	0	0	0	1	0
LDUR	x	1	1	1	1	0	0	0	0
STUR	1	1	x	0	0	1	0	0	0
CBZ	1	0	x	0	0	0	1	0	1

eg. ADD X1, X2, X3 is R-format

suppose current PC is 100. first these happen in parallel:

PC+4 is computed at the top adder
instruction[31-21] goes to control mux
- Reg2Loc (0) is chosen
- ALUOp is two bits 10 (below)
- note: Branch is 0
- note: ALUSrc is 0
- note: MemtoReg is 0
- RegWrite is 1
instruction[9-5] (X2) goes to Read register 1
- Read data 1 outputs
instruction[20-16] (X3) goes to Mux, 0 is chosen, so it goes to Read register 2
- Read data 2 outputs
instruction[4-0] (X1) goes to Write register; also goes to Mux but is not chosen
instruction[31-21] also goes to ALU control

to the right:

Read data 1 goes to ALU
Read data 2 goes to Mux, 0 is chosen because of ALUSrc, then goes to ALU
ALU performs operation according to ALU control.
- ALU result outputs

to the right:

ALU result goes to mem and Mux, since MemtoReg is 0, it will not read data from mem. 0 in Mux is chosen, it goes to register's Write data

at the top:

PC+4 goes to Mux. since Branch bit in Control is 0, mux selects 0, so PC+4 goes back to PC. PC becomes 104.

ALUop has four choices: 00,01,10,11. if it is 10, that means instruction is R format. we will use function bit from instruction[31-21] to determine what to do on ALU. otherwise function bits are useless, we forward the ALUop to ALU (ie 00 -> add, 11 -> sub, 01 -> passb). in this case ADD is R-format, so funct bits are used.

eg. ADDI X1, X2, #4 is I-format
PC operations are same as above. first

instruction[31-21] goes to control mux (even though actual opcode is less than 11 bits)
- note: ALUop is 00
- ALUSrc is 1
instruction[9-5] goes to Read register (X2)
all instruction[31-0] goes to Sign-extend
- Sign-extend interprets opcode and if it is ADDI or SUBI, then instruction[21-10] are unsigned, 12 bit immediate value i care about. Sign-extend generates a 64 bit number by padding 0's even if top bit may be 1. (note if other instructions where immediates are signed, it will pad 1's correctly for negative values, such as LDUR)

to the right:

Read Data 1 goes to ALU
Sign-extend goes to Mux, since ALUSrc is 1, it goes to ALU
as ALUop is 00, the ALU control does not care "funct bits", since 00 -> add, ALU performs ALU addition
writing back similar

eg. LDUR X1, [X2, #400] is D-format

Week 7. June 22

eg. CBZ X1, #100 is CB-format

at the top:

instruction[23-5] goes to sign extend and convert the signed bits to 64 bit signed number by correctly padding 2's complement.
- it is then shifted by 2 (multiplied by 4)
current PC goes to top right adder
- so PC + (address*4) is computed

at same time:

instruction[4-0] goes to Mux
- since Reg2Loc is 1, the Mux selects 1. so it goes to Read register 2, then Read data 2 outputs

at the right:

since ALUSrc is 0, Read data 2 goes to Mux and selects 0, so it goes to ALU.
since this is not R-format, and ALUOp is 01, ALU control forwards pass-b.
- original X1 is output by ALU
- there is a Zero bit corresponding to ALU's result. if result (X1) is 0, then the top AND gate is on. hence the top right Mux is 1. otherwise the Mux choose 0, the default PC+4.
- so PC + (address*4) is chosen and written back

single-cycle control

overview:

ALU control only tells ALU what to do (outputs inputs to ALU)

operation	ALUop	opcode	ALU control input	action
LDUR	00		0010	add
STUR	00		0010	add
CBZ	01		0011	passb
ADDI	00		0010	add
SUBI	11		0110	sub
ADD	10	100010 11000	0010	add
SUB	10	110010 11000	0110	sub
AND	10	100010 10000	0000	and
ORR	10	101010 10000	0001	or

Control (left) hardwares opcodes to control bits!
all combinational logic: no memory, no state bits, only AND, OR gates
there are funct bits within opcode, which is only used when opcode is R-format

timing

eg.

mem units: 200ps, ALU: 200ps, reg file: 100ps read + 100 ps write
other units negligible
then R-format takes 200 (instr mem) + 100 (read reg) + 200 (alu) + 100 (write reg) = 600ps
LDUR: 200 (instr mem) + 100 (read addr from reg) + 200 (alu (compute addr offset)) + 200 (read mem) + 100 (write to reg) = 800ps (slowest)
STUR: 200 + 100 + 200 + 200 = 700ps (don't need to write to reg)
CBZ: 200 (instr mem) + 100 (read reg) + 200 (alu) = 500ps (we say top right adder is negligible)
so clock period must be increased to 800ps.

eg. now assume two top PC adders also take 200ps. timings stay the same because top two adders run in parallel with bottom.

eg. now assume Control mux takes 10ps. then R-formats takes extra 10ps (reads both reg, have to wait for mux); LDUR does not (it only reads one reg); STUR takes extra 10ps; CBZ takes extra 10ps (its reg goes to Mux)

modifying data path

normally design complete dada path for all instructions together
one approach:
- determine new data path needed for new command
- check if any components in current data path can be used
- wire in components of new data path into existing
- add new control signals, adjust old signals

eg. B #400

31-26 is opcode, 25-0 is BR address. also compute PC + 4*offset.

type	Reg2Loc	ALUSrc	MemToReg	RegWrite	MemRead	MemWrite	Branch	ALUop1	ALUop0	Uncond branch
B	x	x	x	0	0	0	x	x	x	1

Uncond branch is 0 for other instructions.

needs to take 400ps in above settings.

memread, memwrite, regwrite must not be don't care

Week 8. June 29

multicycle datapath

take this datapath and break it up into 5 steps. use several clock cycles to execute one instruction
steps:
1. IF: instruction fetch, update PC+4 all times (may update again for branch instructions)
2. ID: instruction decode, read from register file
3. EX: execute, use ALUs
4. MEM: r/w memory, or maybe update PC
5. WB: may write data back to register file

each step/clock cycle is 200ps; every instruction must take 5 steps; total exectution time for every instruction is 1000ps (real muticycle control is smarter)
register banks (intermediate registers) store information computed in one stage that is needed by later stages
simple finite state machine controller will cycle through 5 steps in order
- at a time one cycle is on; at the end of a cc, data is written to intermediate registers; at the beginning of a cc, data is retrieved from intermediate registers
similarity to single cycle: one instruction has exclusive use of the datapath

pipelined datapath

intermediate registers are now pipeline registers: IFID, IDEX, EXMEM, MEMWB
- they store different control bits, regWrite, etc for different ongoing instructions
every stage of the pipeline contains a different instruction
one instruction begins every clock cycle
- ideal: one completed instruction every clock cycle
every pipeline register is updated every clock cycle
branch decision is in MEM stage

eg. at the end of every clock cycle, PC is updated to PC+4 instantly

pipelined control

generate all control bits, and instructions carry them in subsequent stages
(no need for control signals for PC and new pipeline registers)
in each stage, some control bits are used, and are then no longer needed for future stages
for Reg2Loc bit, it is inferred directly from part of the instruction before generating Control bits, so registers know what to read from in ID stage, control bits are generated in parallel in ID stage

pipeline hazards

hazard: event that blocks normal flow of instructions through pipeline

structural hazard: with single instruction/data memory, instruction fetch cannot overlap with load/store
- solution: separate instruction and data mem again
data harzard: result of one instruction is needed by next instruction
control hazard: conditional branch instruction may change sequence of instructions executed

data hazard

eg.

backward arrow means a value needs a future value - cannot go back in time
STUR is ok - value used after writing
ADD is ok: solution: allow reg file to be written to in 1st half, and read from in 2nd half
AND has data hazard: solution: delay 2 cycles
ORR has data hazard: solution: delay a cycle

new code:

000 SUB X2,X1,X3
004 NOP
008 NOP
012 AND X12,X2,X5
016 ORR X13,X6,X2
020 ADD X14,X2,X2
024 STUR X15,[X2,#100]

NOP has its own line number, uses one clock cycle

solution2: in SUB, X2 is computed in ALU in CC3, in AND, X2 is an input of ALU in CC4. we can forward the computed value directly.

forwarding datapath

ALUSrc bit and mux omitted
Forwarding unit takes in three register numbers to determine if the farwarding is needed
- 2 outputs: ForwardA, ForwardB to control input of ALU
  - 0: no forwarding
  - 1: forwarding from WB stage
  - 2: forwarding from MEM stage
only take data from where it is stable: pipeline registers, not middle of a clock cycle

eg. have two R-formats. if destination register of one instruction is the first source of the next instruction, then the condition is

EX/MEM.RegisterRd = ID/EX.RegisterRn1
EX/MEM.RegWrite = 1

where

ID/EX.RegisterRn1 means the name of the register indicated in the Rn field of instruction currently in ID/EX whose data come out of port 1 of the register file
ID/EX.RegisterRm2 means the name of the register indicated in the Rm field of instruction whose data come out of port 2 of the register file (output of Reg2Loc)

eg.

ADD X2, X6, X5
AND X12, X2, X5

eg. load-use hazard: this does not work if the first instruction is LDUR X2, [...] since the value is not ready. we have to delay it (by inserting a NOP between LDUR and AND), and do the forwarding next stage

LDUR X2, [X5, #100]     // NOP should be followed
AND X12, X2, X5

eg. consider code

1. LDUR X1,[X2,#10]
2. LDUR X3,[X5,#10]     // load-use hazard
3. ADD X3,X5,X3
4. STUR X3,[X5,#10]
5. LDUR X1,[X6,#10]     // load-use hazard
6. ADD X3,X1,X4
7. STUR X4,[X3,#10]
8. ADDI X6,X4,X4

assuming forwarding, then we need to insert a NOP after lines 2, 5
assuming without forwarding, we need to insert 2 NOPs after lines 2, 3, 5, 6

Week 9. July 6

conditions of forwarding

EX hazard 1a (forwarding from MEM to EX)
- if EX/MEM.RegWrite and EX/MEM.RegisterRd != 31 and EX/MEM.RegisterRd == ID/EX.RegisterRn1, then forwardA = 10
- 1b similar, but with Rm2, ForwardB
MEM hazard 2a (forwarding from WB to EX)
- if MEM/WB.RegWrite and MEM/WB.RegisterRd != 31 and not (EX/MEM.RegWrite and EX/MEM.Rd != 31 and EX/MEM.RegisterRd == ID/EX.RegisterRn1) and MEM/WB.RegisterRd == ID/EX.RegisterRn1, then ForwardA = 01
- - 1b similar, but with Rm2, ForwardB

forwarding helps everything except LDUR.

load-use stalling

the Hazard detection unit detects load-use hazard and inserts stalls in hardware
delay the instruction behind (ADD) for one cycle
- turn one copy of the instruction behind into a NOP by zeroing the control bits for the next clock cycle
- also block IF/ID register to not allow new instructions to be fetched and prevent PC from changing
so the instruction behind does nothing for one cycle, then continue going

conditions

instruction in ID/EX register must be load (can check MemRead signal)
if ID/EX.MemRead and (ID/EX.RegisterRd == IF/ID.RegisterRn1 or ID/EX.RegisterRd == IF/ID.RegisterRm2)

code rearrangement with data forwarding

do not swap lines of code with data dependencies
do not swap into or out of loops

eg. insert line 5 between 2 and 3 in the last eg to avoid NOP.

timing

assign one clock cycle per instruction, plus the pipelining start time time (4 cycles)

eg. in the last eg, it takes 8 (each) + 2 (two load-use stalling) + 4 (start) = 14 cycles

control hazard (CBZ)

eg.

before the branching decision is made in the branch in mem stage, three instructions (in IF, ID, EX stages) already began, we might want to cancel them if the branch condition is met.

if should take branch, flush the three instructions behind
zero out all control bits for 48 and 52 so they become NOP
44 does not have control bits yet, we may turn it into a harmless NOP instruction (changing opcode)
3 clock cycles wasted
if not take the branch, do nothing

branch in ID stage

branch decision is pulled to ID stage instead of MEM stage
- because we already know the content of X1 in CBZ X1,9 right after it goes out of register file
only flush one instruction in IF stage, thus wasting one clock cycle

eg. what is the running time for

100 ADDI X1,XZR,#20
104 ADDI X2,XZR,#0
108 LDUR X3,[X4,#0]  // loop
112 ADD X2,X2,X3     // hazard <- insert stalling
116 ADDI X4,X4,#8
120 SUBI X1,X1,#1
124 CBNZ X1,#-4      // every time this jumps, add flushing
128 SUB X5,X5,X2
132 LDUR X3,[X6,#100]
136 ADDI X29,XZR,#100

assuming forwarding, stalling, and branching in MEM

2 (first 2 lines)
+ 19(5+1+3) (first 19 iterations: 5 instr, 1 nop, 3 flush)
+ 5+1 (last iteration, 5 instr, 1 nop, no flush)
+ 3 (last 3 lines)
+ 4 (start pipeline) = 186

if there is no flushing, we have to insert 3 NOPs after line 124.

assuming branching in ID

there is branch data hazard in line 120 and 124 since CBNZ wants X1 in ID stage, but X1 now is being computed by SUBI in EX stage. we have to insert one branch data stalling NOP in between so that SUBI can finish computation and store X1 result in MEM/WB register, then forwarded back to CBNZ in ID.
2 (first 2 lines)
+ 19(5+1+1+1) (first 19 iterations: 5 instr, 1 stalling, 1 branch stalling, 1 flush)
+ 5+1+1 (last iteration, 5 instr, 1 stalling, 1 branch stalling, no flush)
+ 3 + 4 = 168

if there is no flushing, we have to insert 1 NOP after line 124.
if we swap line 112 and 120, both data hazard in 112 and branch data hazard in 124 are also resolved -> time: 168-20(2)=128

a branch data hazard can exist in any pipelined datapath when forwarding is not present.

branch prediction

in the IF stage

recognize the address of a CBZ instruction
store BTA in a prediction table which exists in IF stage
guess whether the branch will be taken

PC checks table every clock cycle, if it is CBZ, may choose to load the BTA right away
- or check the prediction bits
- eg: if branch was taken the last time this instruction is executed, take branch again
- still check if X1 is 0 in ID stage
- if branch was predicted incorrectly, flush it out
hardware needs to store
- address of CBZ instruction
- destination of CBZ instruction if taken

eg. strategies:

100 ADDI X4,X31,#6
    ...
120 SUBI X1,X1,#1
124 CBNZ X4,#-5
128 ADD X1,X2,X3

predict "not taken": 5 incorrect (always flush 128), 1 correct (go to 128)
predict "taken": 5 correct (go to 100), 1 incorrect (flush 100)
- hard to implement: need to know BTA ahead of time
"1-bit prediction": 1 incorrect, 4 correct, 1 incorrect
- we update prediction if get wrong once
- first time handling CBZ, record its address and BTA. if the branch is taken, set prediction bit to 1 for this record
- next times handling CBZ, locate this CBZ using the address and BTA, if prediction bit is 1, predict taken, otherwise predict not taken
- if prediction is wrong, update prediction bit
- 0 -> 1 -> 1 -> 1 -> 1 -> 1 -> 0

2-bit prediction

use 2 prediction bits
only if prediction is wrong twice will the prediction change

eg. for the above eg

first time encountering this code: 00 -> 01 -> 10 -> 11 -> 11 -> 11 -> 10 (2 incorrect, 3 correct, 1 incorrect)
second time: 10 -> 11 -> 11 -> 11 -> 11 -> 11 -> 10 (5 correct, 1 incorrect)

performance

assume 22% loads, 11% stores, 49% R-format, 16% branches, 2% jumps (uncond)
assume half of all loads followed by use (load-use hazard)
assume quarter of all branch predictions are incorrect (flushed)
average number of cycles per instruction: 0.22*1.5 + 0.11*1 + 0.49*1 + 0.16*1.25 + 0.02*2 = 1.17
one cycle uses 200ps
B: also in ID stage, but automatic flush of next instruction

Week 10. July 13

memory hierarchies

principles of locality:

temporal locality: if one data location is referenced once, then it will tend to be referenced again soon
- loops: fetch same instructions over again
spatial locality: if one data location is referenced, that data locations with nearby addresses will also tend to be referenced soon

speed	CPU	size	cost	current technology
fastest	memory	smallest	highest	SRAM
	memory			DRAM
slowest	memory	biggest	lowest	magnetic disk

main memory implemented from DRAM
SRAM for levels closer to the processor (caches)

cache associativity

ignore L2, L3; imagine we only have L1 cache (instruction memory/cache, data memory) and main memory.

direct mapped cache

eg. one way associative: items map to one and only one location

when CPU needs memory access at address abcde, cde called lower order 3 bits, ab called tag bits.

8x1 configuration
memory has 2^5 places, cache has 2^3 occupiable places
when request for the first time, there is a miss, the cache unit at index cde is populated with data at mem[abcde] copied from memory, and has a tag bit ab. the valid bit is updated to 1.
when requesting later, cde is indexed in cache, if V is 1, check the tag bit, if it is ab, get the data there.
if tag field does not match, kick out the original cache

V: valid bit: cache is meaningful or empty cache

eg. 2-way set associative cache: 2 rows

memory address being abcde, for cache, index bit is de, have two tag fields, each 3 bits (abc)

4x2 configuration
memory has 2^5 places, cache has 2^2 places
each cache row can store two data, by using two tag fields.
when miss, index cache using de, then find empty slots in this row and store abc as tag bits and copy mem data.

generally increase number of hits

eg. 4-way set associative cache: 2x4 configuration

eg. fully associative cache: 1x8 configuration, only one unique addr, 8 tag fields, 5 tag bits

eg. implementation of 4-way associative cache

note all 4 columns are compared against tag bits in parallel.

eg. as associativity (column) increase, lookup time remains the same but hardware increases, number of possible address in cache increases.

replacement scheme

what to replace in cache when there is no more room
LRU: one location needs to be overwritten, replace least recently used

eg.

larger block size

in prior example we assumed block sizes of one word, and not differentiate between instruction cache and data cache
now:
when requesting, not only one word is fetched, but blocks of words are copied
data cache:
- word size in data cache is 64 bits / 1 dword
instruction cache:
- instruction size in instruction cache is 32 bits / 1 word
- however, the size of data brought in from RAM is going to be multiples of 8 bytes
- block of 1 = dword / 64 bits / 2 instructions

eg.

block size: 4; always aligned; 1-way associative 8x4 configuration
if I try retrieve 104, i determine which block it belongs to (belongs to 96)
run time: 4 + 104 (miss) = 108cc
if I LDUR 368 (tag: 001, indx: 011, block: 10), i have miss. i will rewrite all 4 bytes starting from mem[352] (tag: 001, indx: 011, block: 00)
D: dirty bit:
- write back policy:
- if I STUR, I modify cache; now cache and mem have different data, turn D to 1
- if I LDUR same address, write cache content back to mem, then load mem back to cache (one extra 104cc), turn D to 0
if block size is 8, we need 3 block bit (take the lowest order indx bit and shift it to block bit; if indx bit needs more, take from tag bit)

eg.

run time: 104*3 (misses) + 104*1 (writeback) + 6(instr), assuming no hazard

average run time per instruction: 1 + 104*miss rate

Week 11. July 20

virtual memory

entire programs are stored on disk as pages (virtual memory block)
- page fault: virtual memory miss
virtual memory allows to page in and page out data as needed
- organizing relationship between RAM (physical memory) and disk
- run 2 programs on one processor
- every program thinks it begins at virtual address 0, virtual page 0
- page table in dedicated region in RAM maps from virtual page to physical page (from virtual address to physical address)
- every program has its own page table
- virtual page size = physical page size
LDUR X1, [X2,#0] is virtual address

page table

address broken up to 2 parts: virtual page number, offset
each page of data is a particular size
VPN -> PPN
offset into the page (12 bits) remain the same for both VP and PP
- indicates size of page: 4kb

when program 1 loaded:

step1: ptr stores the location of the page table.
step2: page table translates virtual page number (VPN) to physical page number (PPN)
- if that page is not loaded to memory yet (page fault), then go copy the page from disk
- valid bit tells this
and virtual address to physical address (PA) using a offset relative to page
- the offset needs not be multile of 4 or 8 -> go up by 1
- the offset is copied over (same for both virtual and physical)
- the full address is page number + offset

remark. all pages are always on disk.

page replacement policy:

optimal: replace page used farthest into the future
- good approximation: LRU (least recently used)
VM must use write-back (copyback)
- use dirty bit to denote whether page changed
if page tables are large, can use multi level

translation look aside buffer (TLB)

translation between virtual and physical addresses needs to look up page table in RAM
instead, use cache for address translations only
CPU gets a virtual address:
- first check TLB, if not present, check page table
- whenever there is miss, bring in data; also update page table and TLB

if Ref bit is on, that means the address translation exists in both page table and TLB. if we kick out the page from RAM (thus from page table), we release both page table and TLB references (set valid to 0, if dirty is 1, write back). we also remove cache entry.

if TLB is full, only kicking out an entry from TLB does not trigger memory write-backs, since the page still exists in page table
we can clear all reference bits to 0 if TLB is full

A best scenario: we hold virtual address, find physical address in TLB and data in cache. so we never leave CPU.
B TLB hit and cache miss
C TLB miss and cache hit
D worst: TLB miss and cache miss
TLB and cache are not interdepedent
- however cache and RAM access are dependenet
- TLB and RAM access are dependent
- page table and RAM access are dependent

remark. there are two levels of dirty bits: page level and cache level. for STUR, we update cache, page, then to disk.

swapping between two processes

need to save state of running program
- PC, registers, page table register all saved
- clear: flush cache and TLB
next program starts:
- PC: loaded back. PTR and state of registers restored
- initially cache and TLB will miss

dec	A2	A1	A0	D7	D6	D5	D4	D3	D2	D1	D0
0	0	0	0	0	0	0	0	0	0	0	1
1	0	0	1	0	0	0	0	0	0	1	0
2	0	1	0	0	0	0	0	0	1	0	0
3	0	1	1	0	0	0	0	1	0	0	0
4	1	0	0	0	0	0	1	0	0	0	0
5	1	0	1	0	0	1	0	0	0	0	0
6	1	1	0	0	1	0	0	0	0	0	0
7	1	1	1	1	0	0	0	0	0	0	0

dec	A2	A1	A0	D7	D6	D5	D4	D3	D2	D1	D0
0	0	0	0	0	0	0	0	0	0	0	1
1	0	0	1	0	0	0	0	0	0	1	0
2	0	1	0	0	0	0	0	0	1	0	0
3	0	1	1	0	0	0	0	1	0	0	0
4	1	0	0	0	0	0	1	0	0	0	0
5	1	0	1	0	0	1	0	0	0	0	0
6	1	1	0	0	1	0	0	0	0	0	0
7	1	1	1	1	0	0	0	0	0	0	0

Week 1. May 11

ARM overview

register

branch

conditional branch CBZ

memory access

Week 2. May 18

logic blocks

digitizing

discrete signal

specifying input/output behaviors

truth tables

don't cares in truth tables

laws of boolean algebra

gates

transistors

implementing gates with transistors

NMOS as Not gate

PMOS

CMOS (complementary MOS transistor)

CMOS NAND

2-input AND

3-input NAND

Week 3. May 25

decoders

memory expansion

multiplexer

array

clocks and sequential circuits

SR latch

D-latch

D Flip-Flop

Week 4. June 1

register file

write operation

read operation

finite state machine (Moore)

Week 5. June 8

ASCII

unsigned and signed binary numbers

logic operations

full adder

4-bit ripple-carry adder

arithmetic logic unit

1-bit ALU

64-bit ALU

multiplication

floating point number

Floating Point registers

IEEE 754

addition

multiplication

Week 6. June 15

memory hierarchy

static random access memory (SRAM)

three-state buffer

dynamic RAM (DRAM)

single cycle data path

high level view

instruction formats

opcodes

full data path

walking through data path

Week 7. June 22

single-cycle control

timing

modifying data path

Week 8. June 29

multicycle datapath

pipelined datapath

pipeline hazards

data hazard

forwarding datapath

Week 9. July 6

load-use stalling

code rearrangement with data forwarding

timing

control hazard (CBZ)

branch in ID stage

branch prediction

conditional branch `CBZ`

dec	A2	A1	A0	D7	D6	D5	D4	D3	D2	D1	D0
0	0	0	0	0	0	0	0	0	0	0	1
1	0	0	1	0	0	0	0	0	0	1	0
2	0	1	0	0	0	0	0	0	1	0	0
3	0	1	1	0	0	0	0	1	0	0	0
4	1	0	0	0	0	0	1	0	0	0	0
5	1	0	1	0	0	1	0	0	0	0	0
6	1	1	0	0	1	0	0	0	0	0	0
7	1	1	1	1	0	0	0	0	0	0	0