Instructor: Blake Van Berlo

Office: MW 11:30-12:30 DC2130

MW 10:00-11:20 MC4040

MT: Oct. 17

Week 1. Sept 7

thinks like humans thinks rationally
acts like humans acts rationally

thinking humanly (cognitive modelling approach)

acting humanly (turing test approach)

rationality: abstract 'ideal' of intelligence. system is rational if it does 'right thing' given what they know

thinking rationally (laws of though approach)

acting rationally (rational agent approach)

model behaviors other than thoughts:

use rationality over humans:

Week 2. Sept 12

defn. a search problem has

eg. 8-puzzle problem

5 3            1 2 3
8 7 6    =>    4 5 6
2 4 1          7 8 0

choosing formulations:

we often do not generate the graph explicitly and store it, but use tree as we explore the search graph.

algo. (searching for solution)

  1. construct search tree as we explore paths incrementally from start state
  2. maintain a frontier of paths from start node
  3. frontier contains all paths for expansion
  4. expanding path: remove it from frontier, generate all neighbours of last node, and add paths ending with each neighbour to the frontier
search(G, start, test):
    frontier = {s}
    while frontier is not empty:
        pop path n[0]...n[k] from frontier  // main difference
        if test(n[k]):
            return n[0]...n[k]
        for neighbour n of n[k]:
            add n[0]...n[k]n to frontier

img

DFS

let bb be the branching factor, mm is max depth of the search tree, dd is the depth of the shallowest goal state, then

good when:

bad when:

BFS

fronter is FIFO queue

useful when:

bad when:

img

defn. a search heuristic h(n)h(n) is an estimate of the cost of the cheapest path from node nn to a goal node.

good heuristics:

properties

eg.

   S ---> A ---> B ---> C ---> ...
 1 |  1/2   1/4    1/8     
   v
   N

algo never completes.

properties

eg. suppose cost of arc is its length, hh is the euclidean distance. then algo get stuck in this graph:

img

eg. optimal path is SBCG, but algo found SAG:

img

properties

how to define admissible heuristic:

  1. define a relaxed problem: by simplifying or removing constraints
  2. solve relaxed problem without search
  3. the cost of optimal solution to relaxed problem is an admissible heuristic to original problem

desireable heuristic properties:

defn. given heuristics h1,h2h_1,h_2, h2h_2 dominates h1h_1 if

theorem. if h2h_2 dominates h1h_1, then A* using h2h_2 never expands more nodes than that uses h1h_1.

some heuristics for 8-puzzle:

eg. which heuristic for 8-puzzle is better? manhattan is better since it always adds at least 1 if tile is not in place, but misplaced tile heuristic only adds 1.

cycle pruning

stop expanding path once we are following a cycle

...
for neighbour n of n[k]:
    if n not in n[0]...n[k]:
        add n[0]...n[k]n to frontier
...

multiple-path pruning

if we found a path to a node, we can discard other paths to same node.

search(G, start, test):
    frontier = {s}
    explored = {}
    while frontier is not empty:
        pop path n[0]...n[k] from frontier
        if n[k] in explored:
            continue
        add n[k] to explored
        if test(n[k]):
            return n[0]...n[k]
        for neighbour n of n[k]:
            add n[0]...n[k]n to frontier

eg.

consistent heuristic:

Week 3. Sept 19

constraint satisfaction

difference from heuristic search problem?

defn. in CSP, each state contains:

a solution is an assignment of values to all variables satisfying all constraints.

eg. state definition of 4-queens

eg. do you need to specify column constraint? no since it is already defined implicitly by separate variables for each column.

two ways of defining constraints:

we need to reformulate problem into incremental problem.

backtrack(assignment, CSP):
    if assignment is complete:
        return assignment
    var = unassigned variable
    for every value of var.domain:
        if adding {var=value} satisfies all constraints:
            add {var=value} to assignment
            result = backtrack(assignment, CSP)
            if result is not FAIL:
                return result
        remove {var=value} from assignment if it was added
    return FAIL

eg. incremental CSP formulation for 4-queen:

img

arc-consistency

notation. if X,YX,Y are two variables, write c(X,Y)c(X,Y) as a binary constraint.

img

defn. an arc X,c(X,Y)\lang X,c(X,Y)\rang is arc-consistent iff for every value vDXv\in D_X, there is a value wDYw\in D_Y such that (v,w)(v,w) satisfies the constraint c(X,Y)c(X,Y).

eg. consider constraint X<YX<Y. let DX=DY={1,2}D_X=D_Y=\{1,2\}. is the arc X,X<Y\lang X,X<Y\rang consistent?
no since if x=2x=2 we cannot find value in DYD_Y to satisfy constraint.

algo. (AC-3 arc consistency)

// reduce domains to permit only valid answers
AC3(arcs):
    S = all arcs
    while S is not empty:
        select and remove ⟨X, c(X, Y)⟩ from S
        remove every value in X.domain that does not have a value in Y.domain \
            that satisfies the constraint c(X, Y)
        if X.domain was reduced:
            if X.domain is empty:  // no solution
                return false
            for every Z != Y:
                add ⟨Z, c'(Z, X)⟩ to S  // or c'(X, Z)
    return true

img

eg. why do we need to add arcs back to S after reducing a variable's domain?
reducing a variable's domain may cause a previously consistent arc to become inconsistent

properties:

algo. (combining backtracking & arc consistency)

  1. perform backtracking search
  2. after each value assignment, do arc consistency
  3. if domain is empty, return no solution (backtracks)
  4. if unique solution is found, return solution
  5. otherwise, continue search on the unassigned variables

difference:

complete-state formulation: start with a complete assignment of values to variables, then take steps to improve solution iteratively

defn. a local search problem consists of:

eg. local search problem for 4-queens

the version 2 of the neighbour relation has disconnected components (eg. starting from 3211 we never get to optimal solution). however even if graph is connected we still might not find global optimum.

greedy descent (hill climbing)

descend into a canyon in a thick fog with amnesia/

properties:

img

eg. consider 3210 with relation 2, it is not local optimum nor global.

eg. consider 2301 with relation 1, it is (flat) local optimum but not global.

escape flat local maxima:

choosing neighbour relation:

random moves:

eg. random restart is good for a, random walk is good for b
img

simulated annealing

let AA be current state and AA' is the worse neighbour, let TT be current temp, then we will move to AA' with prob eΔCTe^{-\frac{\Delta C}{T}} where ΔC=cost(A)cost(A)\Delta C=\mathrm{cost}(A')-\mathrm{cost}(A).

algo.

simulated_annealing():
    current = initial_state
    T = large positive value
    while T > 0:
        next = random_choice(current.neighbours)
        dC = cost(next) - cost(current)
        if dC < 0:
            current = next
        else:
            current = next with prob e^(-dC/T)
    decrease(T)
    return current

in practice, geometric cooling is most widely used (multiply by .99 each step)

population-based

instead of remembering one single state, we remember multiple states.

beam search

eg.

eg. how is beam search different from k random restarts in parallel? in beam search, useful info can be passed across each thread.

problem with beam search: suffers from lack of diversity among k states => can quickly becomes concentrated in a small region.

stochastic beam search:

genetic algo:

Week 4. Sept 26

uncertainty

why uncertainty?

probability is formal measure of uncertainty.

two camps:

  1. Frequentists' view
  2. Bayesian's view (primary)

some axioms:

  1. 0P(A)10\leq P(A)\leq1
  2. P(true)=1,P(false)=0P(\mathrm{true})=1,P(\mathrm{false})=0
  3. inclusion-exclusion principle: P(AB)=P(A)+P(B)P(AB)P(A\lor B)=P(A)+P(B)-P(A\land B)

defn.

defn. P(X)P(X) is prior or unconditional probability, ie likelihood of X in the absence of any other information and is based on background info

defn. P(XY)P(X|Y) is posterior or conditional probability, ie based on Y as evidence.

theorem. (sum rule) P(A)=P(AB)+P(A¬B)P(A)=P(A\land B)+P(A\land\lnot B).

theorem. (product rule) P(AB)=P(AB)P(B)P(A|B)=\frac{P(A\land B)}{P(B)}.

theorem. P(Xn...X1)=i=1nP(XiXi1...X1)P(X_n\land...\land X_1)=\prod_{i=1}^nP(X_i|X_{i-1}\land...\land X_1).

eg. given

then P(AWG)=P(A)P(WA)P(GAW)=.1.9.7=.063P(AW\overline{G})=P(A)\cdot P(W|A)\cdot P(\overline{G}|AW)=.1\cdot.9\cdot.7=.063.

theorem. (Bayes' rule) P(XY)=P(YX)P(X)P(Y)P(X|Y)=\frac{P(Y|X)P(X)}{P(Y)}.

universal approach

steps:

  1. convert conditional probability to joint probabilities
  2. change each joint probability to involve all variables
  3. calculate a joint probability using chain rule

eg. compute P(AC)P(A|C) given

  1. step 1: P(AC)=P(AC)P(C)P(A|C)=\frac{P(AC)}{P(C)}
  2. step 2:
  3. step 3:
  4. finally P(AC)=...=.614P(A|C)=...=.614

defn. X,YX,Y are independent if any of following is true:

defn. X,YX,Y are conditionally independent given ZZ if any of following is true:

eg. suppose we have A,B,CA,B,C. how many probabilities do we need to specify the joint distribution?
P(A),P(BA),P(CAB)    1+2+4=7P(A),P(B|A),P(C|AB)\implies1+2+4=7 probabilities.

img

in general, we need 2n12^n-1 probabilities to specify joint distribution of n boolean RVs.

eg. suppose we have A,B,CA,B,C and A,BA,B are conditionally independent given CC. how many probabilities do we need to specify the joint distribution?
5.

img

bayesian network

defn. a bayesian network is a DAG where

it is a compact version of joint distribution.

there are two ways to understand bayesian networks:

we can compute each joint probability using

P(Xn...X1)=i=1n(XiParents(Xi))P(X_n...X_1)=\prod_{i=1}^n(X_i|\mathrm{Parents}(X_i))

eg. what is probability of ABEWGEA\overline{B}\overline{E}WG\overline{E}?
img

P(BEARGW)=P(B)P(E)P(ABE)P(RE)P(GA)P(WA)=(10.0001)(10.0003)(0.01)(10.0002)(0.4)(0.8)=3.2103\begin{aligned} P(\overline{BE}A\overline{R}GW)&=P(\overline{B})P(\overline{E})P(A|\overline{BE})P(\overline{R}|\overline{E})P(G|A)P(W|A)\\ &=(1-0.0001)(1-0.0003)(0.01)(1-0.0002)(0.4)(0.8)\\ &=3.2\cdot10^{-3} \end{aligned}

eg. are B and W independent?

B -> A -> W

no. if B is true, then A is more likely true, so W is more likely true.

are B, W conditionally independent given A? yes.

eg. are W and G independent?

A -> W
  -> G

no. if W is more likely true, then A is more likely true, so G is more likely true.

are W, G conditionally independent given A? yes.

eg. are E, B independent?

E -> A
B ->

yes.

are E, B conditionally independent given A? no. suppose A is true. if E is true, then it is less likely that A is caused by B. if B is true, then it is less likely A is caused by E.

Week 5. Oct 3

defn. observed variable EE d-separates XX and YY iff EE blocks every undirected path between XX and YY.

claim. if EE d-separates XX and YY, then XX and YY are conditionally independent given EE.

what do we mean by block?

case 1. if N is observed, then it blocks path between X, Y

X ---- A ---> N ---> B ---- Y
  any                   any
undirected          undirected
  path                 path

case 2. if N is observed, then it blocks path between X, Y

X ---- A <--- N ---> B ---- Y

case 3. if N and its descendants are NOT observed, then they block path between X, Y

X ---- A ---> N <--- B ---- Y
              |
              v
             ...

otherwise we cannot say guarantee independence relationship.

eg.
img

constructing bayesian networks

for a joint probability distribution, there are many correct bayesian networks

defn. given a bayesian network AA, a bayesian network BB is correct iff:

we prefer a network that requires fewer probabilities.

notes:

an edge only represents correlation, not causality.

algo. (constructing correct bayesian network)

  1. order variables {X1,...,Xn}\{X_1,...,X_n\}
  2. for each variable XiX_i,
    1. choose the node's parents such that P(XiParents(Xi))=P(XiXi1...X1)P(X_i|\mathrm{Parents}(X_i))=P(X_i|X_{i-1}\land...\land X_1)
    2. ie choose the smallest set of parents from {X1,...,Xi1}\{X_1,...,X_{i-1}\} such that given Parents(Xi)\mathrm{Parents}(X_i), XiX_i is independent of all nodes in {X1,...,Xi1}Parents(Xi)\{X_1,...,X_{i-1}\}-\mathrm{Parents}(X_i)
    3. create a link from each parent of XiX_i to XiX_i
    4. write down conditionals probability table P(XiParents(Xi))P(X_i|\mathrm{Parents}(X_i))

eg. consider network

B ---> A ---> W

construct correct bayesian network by adding variables W, A, B.

W --> A --> B

in the original network, B and W have no edges so there is no requirement for B and W in new graph. we could draw arrow from W to B.

eg. consider network

A ---> W
 \
  v
   G

construct correct bayes net by adding variables in the order: W, G, A.

W ---> G
 \    /
  v  v
    A

Step 1: add W to the network

Step 2: add G to the network

What are the parent nodes of G? The network had one node before adding G. We have two options: Either G has no parent, or W is G’s parent.

If W is not G’s parent, then the new network requires G and W to be unconditionally independent (since we haven’t added A to the network yet). Can the new network require this?

Let’s look at the original network. In the original network, there is no edge between W and G. By d-separation, W and G are independent given A. However, if we haven’t observed A, W and G are not unconditionally independent. Since the original network does not require W and G to be unconditionally independent, the new network also cannot require W and G to be unconditionally independent. Therefore, W must be G’s parent.

Step 3: add A to the network. What are the parent nodes of A? W and G were added before adding A. We have four options: no parent, W is the only parent, G is the only parent, and W and G are both parents of A.

In the original network, W and A are directly connected. The original network does not require W and A to be independent. Therefore, the new network cannot require W and A to be independent. W must be A’s parent.

Similarly, in the original network, G and A are directly connected. The original network does not require G and A to be independent. Therefore, the new network cannot require G and A to be independent. G must be A’s parent as well.

eg. consider network

E ---> A
      ^
     /
    B

construct correct bayes net by adding variables in the order: A, B, E.

A ---> B
 \    /
  v  v
   E

Step 1: add A to the network.

Step 2: add B to the network. What are the parent nodes of B? The network had one node before adding B. We have two options: Either B has no parent, or A is B’s parent.

In the original network, A and B are directly connected. The original network does not require A and B to be independent. Thus, the new network cannot require A and B to be independent. A must be B’s parent.

Step 3: add E to the network.

What are the parent nodes of E? A and B were added before E. We have four options: no parent, A is the only parent, B is the only parent, and A and B are both parents of E.

In the original network, A and E are directly connected. The original network does not require A and E to be independent. Therefore, the new network cannot require A and E to be independent. A must be E’s parent.

Should B be E’s parent or not? If B is not E’s parent, then the new network requires E to be independent of B given A. Can the new network require this? Let’s look at the original network. In the original network, B and E are not independent given A. Since the original network does not require B and E to be independent given A, then the new network also cannot require B and E to be independent given A. Therefore, B must also be E’s parent.

eg.
img

original network has 12 probabilities, new net has 33 probabilities.

original network is easier since it is picked from casual relationships.

how to add fewest edges? cause precedes effect: pick causal relationship first, then effect.

variable elimination

eg. compute probabilistic query: P(Bwg)=P(bwg),P(¬bwg)P(B|w\land g)=\lang P(b|w\land g),P(\lnot b|w\land g) \rang, where B is query variable, W, G are evidence variables, E, A, R are hidden variables. lowercase letters are for values.

P(Bwg)=P(Bwg)P(wg)=P(Bwg)P(bwg)+P(¬bwg)P(Bwg) earP(Beawgr)earP(B)P(e)P(aBe)P(wa)P(ga)P(re) P(B)eP(e)aP(aBe)P(wa)P(ga)rP(re) P(B)eP(e)aP(aBe)P(wa)P(ga)\begin{aligned} P(B | w \wedge g) &=\frac{P(B \wedge w \wedge g)}{P(w \wedge g)} \\ &=\frac{P(B \wedge w \wedge g)}{P(b \wedge w \wedge g)+P(\neg b \wedge w \wedge g)} \\ & \propto P(B \wedge w \wedge g) \\\ & \propto \sum_e \sum_a \sum_r P(B \wedge e \wedge a \wedge w \wedge g \wedge r) \\ & \propto \sum_e \sum_a \sum_r P(B) P(e) P(a | B \wedge e) P(w | a) P(g | a) P(r | e)\\\ & \propto P(B)\sum_e P(e)\sum_a P(a|B\land e)P(w|a)P(g|a)\sum_r P(r|e)\\\ & \propto P(B)\sum_e P(e)\sum_a P(a|B\land e)P(w|a)P(g|a) \end{aligned}

normalize at the end. 57 => 14 operations.

algo. (inference by enumeration)

enumerate(vars, bn, evidence):
    if vars is empty:
        return 1.0
    Y = vars[0]
    if Y has value y in evidence:
        return P(Y|parents(Y)) * Enumerate(vars[1:], bn, evidence)
    else:
        return ∑_y P(y|parents(Y)) * Enumerate(vars[1:], bn, evidence ∪ {y})

it is challenging to perform probabilistic inference so we use approximation.

defn. a factor is a function from some random variables to a number.

factors can represent a joint or conditional distribution or else.

defn. to restrict a factor, we mean assigning an observed value to evidence variable.

defn. to sum out a variable, we sum out X1X_1 with domain {v1,...,vk}\{v_1,...,v_k\} from factor f(X1,...,Xj)f(X_1,...,X_j) and produce a factor defined by:

fX1(X2,...,Xj)=f(X1=v1,...,Xj)+...+f(X1=vk,...,Xj)f_{\sum_{X_1}}(X_2,...,X_j)=f(X_1=v_1,...,X_j)+...+f(X_1=v_k,...,X_j)

defn. to multiply factors f1(X,Y)f_1(X,Y) and f2(X,Y)f_2(X,Y) where YY is the variable in common, the product is defined by (domain is cartesian product):

(f1×f2)(X,Y,Z)=f1(X,Y)×f2(Y,Z)(f_1\times f_2)(X,Y,Z)=f_1(X,Y) \times f_2(Y,Z)

defn. to normalize a factor, divide each value by the sum of all values.

eg. given factor f(X,Y,Z)f(X,Y,Z):

X Y Z value
t t t .1
t t f .9
t f t .2
t f f .8
f t t .4
f t f .6
f f t .3
f f f .7

what is f2(Y,Z)=f1(X=x,Y,Z)f_2(Y,Z)=f_1(X=x,Y,Z)?

Y Z value
t t .1
t f .9
f t .2
f f .8

what is f3(Y)=f2(Y,Z=¬z)f_3(Y)=f_2(Y,Z=\lnot z)?

Y value
t .9
f .8

what is f4()=f3(Y=¬y)f_4()=f_3(Y=\lnot y) f3()=.8f_3()=.8

what is fX(Y,Z)f_{\sum_{X}}(Y,Z)?

Y Z value
t t .1+.4=.5
t f .9+.6=1.5
f t .2+.3=.5
f f .8+.7=1.5

what is (f3×fX)(Y,Z)(f_3\times f_{\sum_X})(Y,Z)?

Y Z value
t t .9*.5
t f .9*1.5
f t .8*.5
f f .8*1.5

algo. (variable elimination) to compute P(XqXo1=v1...Xoj=vj)P(X_q|X_{o1}=v_1\land...\land X_{oj}=v_j),

  1. construct a factor for each conditional probability distribution
  2. restrict observed variables to their observed values
  3. eliminate each hidden variable XhjX_{hj}
    1. multiply all factors that contain XhjX_{hj} to get gjg_j
    2. sum out variable XhjX_{hj} from factor gjg_j
  4. multiply remaining factors
  5. normalize resulting factor

eg. find P(B¬a)P(B|\lnot a)
img

properties of VEA:

eg. which of R and A do we eliminate first?
img

eliminate R first. otherwise A's factor is big.

choosing elimination ordering:

claim. (irrelevant variables) every variable that is not an ancestor of a query variable or evidence variable is irrelevant to query.

Week 7. Oct 17

hidden markov models

we let the current state depend on a fix number of previous states.

defn. (1st order markov assumption) the first-order markov process has transition model:

P(StSt1St2...S0)=P(StSt1)P(S_t|S_{t-1}S_{t-2}...S_0)=P(S_t|S_{t-1})

ie current state only depends on last state.

defn. (sensor markov assumption) each state is sufficient to generate the next observation:

P(OtStSt1...S0Ot1...O0)=P(OtSt)P(O_t|S_tS_{t-1}...S_0O_{t-1}...O_0)=P(O_t|S_t)

stationary process:

eg. the outside can rain or not, people can also probably bring umbrella into the room. so there is dependence between rain and umbrella.

then we have network
img

common tasks in inference

defn. (filtering): the posterior distribution over most recent state given all evidence to date is P(StO0:t)P(S_t|O_{0:t})

algo.

proof. the base case comes from bayes rule; the recursion is:

P(Sko0:k)=P(Skoko0:(k1))=αP(okSko0:(k1))P(Sko0:(k1))(bayes rule)=αP(okSk)P(Sko0:(k1))(markov assumption)=αP(okSk)sk1P(Sksk1o0:(k1))(sum rule)=αP(okSk)sk1P(Sksk1o0:(k1))P(sk1o0:(k1))(product rule)=αP(okSk)sk1P(Sksk1)P(sk1o0:(k1))(see network)\begin{aligned} P(S_k | o_{0: k})&=P(S_k | o_k \wedge o_{0:(k-1)}) \\ &=\alpha P(o_k | S_k \wedge o_{0:(k-1)}) P(S_k | o_{0:(k-1)})&\text{(bayes rule)}\\ &=\alpha P(o_k | S_k) P(S_k | o_{0:(k-1)})&\text{(markov assumption)} \\ &=\alpha P(o_k | S_k) \sum_{s_{k-1}} P(S_k \wedge s_{k-1} | o_{0:(k-1)})&\text{(sum rule)} \\ &=\alpha P(o_k | S_k) \sum_{s_{k-1}} P(S_k | s_{k-1} \wedge o_{0:(k-1)}) P(s_{k-1} | o_{0:(k-1)})&\text{(product rule)} \\ &=\alpha P(o_k | S_k) \sum_{s_{k-1}} P(S_k | s_{k-1}) P(s_{k-1} | o_{0:(k-1)})&\text{(see network)} \end{aligned}

\square

eg. filtering in the umbrella example we have

P(S0o0)=αP(o0S0)P(S0)=α0.9,0.20.5,0.5=α0.45,0.1=0.818,0.182\begin{aligned} P(S_0|o_0)&=\alpha P(o_0|S_0)P(S_0)\\ &=\alpha\lang0.9,0.2\rang*\lang0.5,0.5\rang\\ &=\alpha\lang0.45,0.1\rang\\ &=\lang0.818,0.182\rang \end{aligned}

then

P(S1o0:1)=αP(o1S1)s0P(S1s0)P(s0o0)=αP(o1S1)(P(S1s0)P(s0o0)+P(S1¬s0)P(¬s0o0))=αP(o1s1),P(o1¬s1)(P(s1s0),P(¬s1s0)P(s0o0)+P(s1¬s0),P(¬s1¬s0)P(¬s0o0))=α0.9,0.2(0.7,0.30.818+0.3,0.70.182)=α0.9,0.2(0.5726,0.2454+0.0546,0.1274)=α0.9,0.20.6272,0.3728=α0.56448,0.07456=0.883,0.117\begin{aligned} P(S_1|o_{0:1})&=\alpha P(o_1|S_1)\sum_{s_0}P(S_1|s_0)P(s_0|o_0)\\ &=\alpha P(o_1 | S_1) *(P(S_1 | s_0) * P(s_0 | o_0)+P(S_1 | \neg s_0) * P(\neg s_0 | o_0))\\ &=\alpha\langle P(o_1 | s_1), P(o_1 | \neg s_1)\rangle\\ &\quad*(\langle P(s_1 | s_0), P(\neg s_1 | s_0)\rangle * P(s_0 | o_0)\\ &\quad+\langle P(s_1 | \neg s_0), P(\neg s_1 | \neg s_0)\rangle * P(\neg s_0 | o_0))\\ &=\alpha\langle 0.9,0.2\rangle(\langle 0.7,0.3\rangle * 0.818+\langle 0.3,0.7\rangle * 0.182) \\ &=\alpha\langle 0.9,0.2\rangle(\langle 0.5726,0.2454\rangle+\langle 0.0546,0.1274\rangle) \\ &=\alpha\langle 0.9,0.2\rangle *\langle 0.6272,0.3728\rangle \\ &=\alpha\langle 0.56448,0.07456\rangle\\ &=\lang0.883,0.117\rang \end{aligned}

defn. (prediction) posterior distribution over the future given all evidence to date is P(SkO0:t)P(S_k|O_{0:t}) for k>tk>t.

use the recursion: P(Sk+1o0:t)=skP(Sk+1sk)P(sko0:t)P(S_{k+1}|o_{0:t})=\sum_{s_k}P(S_{k+1}|s_k)P(s_k|o_{0:t}).

defn. (smoothing) posterior distribution over a past state, given all evidence to date is P(SkO0:t)P(S_k|O_{0:t}) for 0k<t0\leq k<t.

algo. given observations from day 0 to t-1, to compute P(Sko0:(t1))P(S_k|o_{0:(t-1)}) where 0kt10\leq k\leq t-1, we do

P(Sko0:(t1))=αP(Sko0:k)P(o(k+1):(t1)Sk)=:αf0:kb(k+1):(t1)\begin{aligned} P(S_k|o_{0:(t-1)})&=\alpha P(S_k|o_{0:k})P(o_{(k+1):(t-1)}|S_k)\\ &=:\alpha f_{0:k}b_{(k+1):(t-1)} \end{aligned}

we calculate b(k+1):(t1)b_{(k+1):(t-1)} using backward recursion:

proof. for the formula:

P=(Sko0:(t1))=P(Sko(k+1):(t1)o0:k)=αP(Sko0:k)P(o(k+1):(t1)Sko0:k)(bayes rule)=αP(Sko0:k)P(o(k+1):(t1)Sk)(markov assumption)\begin{aligned} P&=(S_k | o_{0:(t-1)}) \\ &=P(S_k | o_{(k+1):(t-1)} o_{0: k}) \\ &=\alpha P(S_k o_{0: k}) P(o_{(k+1):(t-1)} | S_k o_{0: k})&\text{(bayes rule)} \\ &=\alpha P(S_k o_{0: k}) P(o_{(k+1):(t-1)} | S_k)&\text{(markov assumption)} \\ \end{aligned}

for the backward recursion:

P(o(k+1):(t1)Sk)=sk+1P(o(k+1):(t1)sk+1Sk)(sum rule)=sk+1P(o(k+1):(t1)sk+1Sk)P(sk+1Sk)(product rule)=sk+1P(o(k+1):(t1)sk+1)P(sk+1Sk)(markov assumption)=sk+1P(ok+1o(k+2):(t1)sk+1)P(sk+1Sk)=sk+1P(ok+1sk+1)P(o(k+2):(t1)sk+1)P(sk+1Sk)(markov assumption)\begin{aligned} P(o_{(k+1):(t-1)} | S_k) &=\sum_{s_{k+1}} P(o_{(k+1):(t-1)} | s_{k+1} | S_k) &\text{(sum rule)}\\ &=\sum_{s_{k+1}} P(o_{(k+1):(t-1)} | s_{k+1} S_k) * P(s_{k+1} | S_k)&\text{(product rule)} \\ &=\sum_{s_{k+1}} P(o_{(k+1):(t-1)} | s_{k+1}) * P(s_{k+1} | S_k) &\text{(markov assumption)}\\ &=\sum_{s_{k+1}} P(o_{k+1} o_{(k+2):(t-1)} | s_{k+1}) * P(s_{k+1} | S_k) \\ &=\sum_{s_{k+1}} P(o_{k+1} | s_{k+1}) * P(o_{(k+2):(t-1)} | s_{k+1}) * P(s_{k+1} | S_k)&\text{(markov assumption)} \end{aligned}

\square

eg. compute P(S1o0:2)P(S_1|o_{0:2}).
k=1, t=3, we need to compute αf0:1b2:2\alpha f_{0:1}b_{2:2}.

b2:2=P(o2:2S1)=s2P(o2s2)b3:2P(s2S1)=s2P(o2s2)P(o3:2s2)P(s2S1)=s2P(o2s2)P(o3:2s2)P(s2s1),P(s2¬s1)=(P(o2s2)P(o3:2s2)P(s2s1),P(s2¬s1)..+P(o2¬s2)P(o3:2¬s2)P(¬s2s1),P(¬s2¬s1))=(0.910.7,0.3+0.210.3,0.7)=(0.90.7,0.3+0.20.3,0.7)=(0.63,0.27+0.06,0.14)=0.69,0.41\begin{aligned} b_{2: 2}&=P(o_{2: 2} | S_1) \\ &=\sum_{s_2} P(o_2 | s_2) * b_{3: 2} * P(s_2 | S_1) \\ &=\sum_{s_2} P(o_2 | s_2) * P(o_{3: 2} | s_2) * P(s_2 | S_1) \\ &=\sum_{s_2} P(o_2 | s_2) * P(o_{3: 2} | s_2) *\langle P(s_2 | s_1), P(s_2 | \neg s_1)\rangle \\ &=(P(o_2 | s_2) * P(o_{3: 2} | s_2) *\langle P(s_2 | s_1), P(s_2 | \neg s_1)\rangle. \\ &.\quad \quad \quad+P(o_2 | \neg s_2) * P(o_{3: 2} | \neg s_2) *\langle P(\neg s_2 | s_1), P(\neg s_2 | \neg s_1)\rangle)\\ &=(0.9 * 1 *\langle 0.7,0.3\rangle+0.2 * 1 *\langle 0.3,0.7\rangle) \\ &=(0.9 *\langle 0.7,0.3\rangle+0.2 *\langle 0.3,0.7\rangle) \\ &=(\langle 0.63,0.27\rangle+\langle 0.06,0.14\rangle) \\ &=\langle 0.69,0.41\rangle \end{aligned}

eg. reuse intermediate values to compute P(S000:3)P(S_0|0_{0:3}) to P(S3o0:3)P(S_3|o_{0:3}) in a row.
img

defn. (most likely explanation) find the sequence of states that is most likely generated all the evidence to date is P(S0:tO0:t)P(S_{0:t}|O_{0:t})

Week 8. Oct 24

learning

the learning architecture

types of learning problems:

two types of supervised learning problems:

supervised learning

learning as search problem:

eg. fitting points

img

all curves can be justified as the correct one from some perspective.

no free lunch theorem: to learning something useful, we have to make some assumptions - have an inductive bias

generalization:

bias-variance tradeoff:

img

img

cross validation

steps:

  1. break training data into L equally sized partitions
  2. train a learning algo on K-1 partitions
  3. test on remaining 1 partition
  4. do this K times
  5. calculate average error to assess the model

after the cross validation

decision tree

how to build decision tree?

which decision to create?

eg. will Bertie plat tennis?

training set
Day Outlook Temp    Humidity    Wind    Tennis?
1   Sunny   Hot High    Weak    No
2   Sunny   Hot High    Strong  No
3   Overcast    Hot High    Weak    Yes
4   Rain    Mild    High    Weak    Yes
5   Rain    Cool    Normal  Weak    Yes
6   Rain    Cool    Normal  Strong  No
7   Overcast    Cool    Normal  Strong  Yes
8   Sunny   Mild    High    Weak    No
9   Sunny   Cool    Normal  Weak    Yes
10  Rain    Mild    Normal  Weak    Yes
11  Sunny   Mild    Normal  Strong  Yes
12  Overcast    Mild    High    Strong  Yes
13  Overcast    Hot Normal  Weak    Yes
14  Rain    Mild    High    Strong  No

test set
1   Sunny   Mild    High    Strong  No
2   Rain    Hot Normal  Strong  No
3   Rain    Cool    High    Strong  No
4   Overcast    Hot High    Strong  Yes
5   Overcast    Cool    Normal  Weak    Yes
6   Rain    Hot High    Weak    Yes
7   Overcast    Mild    Normal  Weak    Yes
8   Overcast    Cool    High    Weak    Yes
9   Rain    Cool    High    Weak    Yes
10  Rain    Mild    Normal  Strong  No
11  Overcast    Mild    High    Weak    Yes
12  Sunny   Mild    Normal  Weak    Yes
13  Sunny   Cool    High    Strong  No
14  Sunny   Cool    High    Weak    No

use the following order of testing features:

  1. test outlook
  2. for outlook = sunny, test temp
  3. for outlook = rain, test wind
  4. for all other cases, test humidity then wind

img

(Outlook=SunnyTemp=Cool)(Outlook=SunnyTemp=MildHumidity=Normal)...\begin{aligned} &(Outlook=Sunny\land Temp=Cool)\\ &\lor(Outlook=Sunny\land Temp=Mild\land Humidity=Normal)\\ &\lor... \end{aligned}

stopping conditions

eg. no more features.
img

eg. no examples.
img

algo. (learning decision tree)

learner(examples, features):
    if all examples are same class:
        return class label
    if no features left:
        return majority decision
    if no examples left:
        return parent's majority decision
    choose feature f
    for each value in f:
        build edge with label v
        build subtree using examples where value of f is v

determining orders

each decision tree encodes a propositional formula. if we have n features, then each function corresponds to a truth table, each table has 2n2^n rows. there are 22n2^{2^n} possible truth tables.

we want to use greedy search => myopic decision at each step.

eg. which tree to use?
img
use 1st tree.

defn. given a distribution, P(c1),...,P(ck)P(c_1),...,P(c_k) over kk outcomes c1,...,ckc_1,...,c_k, the entropy is

I(P(c1),...,P(ck))=i=1kP(ci)log2(P(ci))I(P(c_1),...,P(c_k))=-\sum_{i=1}^kP(c_i)\log_2(P(c_i))

it is a measure of uncertainty.

eg. consider a distribution over two outcomes p,1p\lang p,1-p\rang where 0p10\leq p\leq 1. the max entropy is 1 (at p=1/2); min entropy by definition is 0 (at p=0 or 1).

when deciding feature, compute entropy before testing a feature:

Hi=I(pp+n,np+n)H_i=I\left(\frac{p}{p+n},\frac{n}{p+n}\right)

where pp is # of positive cases and nn is # of negative cases. the expected entropy after testing the feature is

Hf=i=1kpi+nip+nI(pipi+ni,nipi+ni)H_f=\sum_{i=1}^k\frac{p_i+n_i}{p+n}I\left(\frac{p_i}{p_i+n_i},\frac{n_i}{p_i+n_i}\right)

where pi,nip_i,n_i corresponds to # of cases for feature value viv_i. then the information gain (entropy reduction) is HiHfH_i-H_f.

we should test the feature that has more information gain.

real-valued features

method to choose split points

  1. sort instances according to real-valued feature
  2. possible split points are values that are midway between two different and adjacent values
  3. suppose feature value changes from X to Y, should we consider (X+Y)/2 as possible split point?
    1. let LXL_X be all labels for examples where feature takes value XX
    2. let LYL_Y be all labels for examples where feature takes value YY
    3. if there exists labels aLX,bLYa\in L_X,b\in L_Y such that aba\neq b, then (X+Y)/2 is possible
  4. determine expected info gain for each possible split point and choose one with largest gain

eg. is midway between 21.7 and 22.2 possible split point if we have

1. 21.7 No
2. 22.2 No
3. 22.2 Yes

yes. the 1st and 3rd labels are different.

overfitting

problem: growing a full tree is likely to lead to overfitting.

strategies:

Week 9. Oct 31

neural network

the relationships between input and output are complex, need to build model that mimics human brain.

simple model of neuron:

img

desirable properties of activation function

common activation functions:

defn. (step function)

g(x)={0,x>01,x0g(x)=\begin{cases} 0, &x>0\\ 1, &x\leq0 \end{cases}

defn. (sigmoid function)

σ(x)=11+ekx\sigma(x)=\frac{1}{1+e^{-kx}}

img

defn. (rectified linear unit / ReLU)

g(x)=max(0,x)g(x)=\max(0,x)

defn. (leaky ReLU)

g(x)=max(0,x)+kmin(0,x)g(x)=\max(0,x)+k\min(0,x)

img

perceptrons

kinds of networks:

perceptrons:

eg. using step function as activation function, what does this represent?
img

o1=g(iwixi)=g(10.5+x1x2)=g(0.5x1x2)o_1=g(\sum_{i}w_ix_i)=g(1\cdot0.5+-x_1-x_2)=g(0.5-x_1-x_2)

it is NOR gate.

limitations of perceptrons:

img

eg. show it is not possible to use perceptron to learn XOR.
suppose we can represent XOR using a perceptron and the activation function is step function. then we have

w211+w110+w01>0    w21>w01w210+w111+w01>0    w11>w01w211+w111+w010    w21+w11w01w210+w110+w010    w010\begin{aligned} w_{21}\cdot1+w_{11}\cdot0+w_{01}>0&\implies w_{21}>-w_{01}\\ w_{21}\cdot0+ w_{11}\cdot1+w_{01}>0&\implies w_{11}>-w_{01}\\ w_{21}\cdot1+w_{11}\cdot1+w_{01}\leq0&\implies w_{21}+w_{11}\leq-w_{01}\\ w_{21}\cdot0+ w_{11}\cdot0+w_{01}\leq0&\implies w_{01}\leq0\\ \end{aligned}

adding first two equations we get w21+w11>2w01w_{21}+w_{11}>-2w_{01}. from 4th equation we get 2w01w01-2w_{01}\geq-w_{01}.

then we have w21+w11>2w01w01w21+w11w_{21}+w_{11}>-2w_{01}\geq-w_{01}\geq w_{21}+w_{11}, a contradiction.

we can 2-layer perceptrons to learn the XOR function:
img

it is a combination of an AND and NAND: (x1x2)¬(x1x2))(x_1\land x_2)\land\lnot(x_1\land x_2)).

gradient descent

a local search algo to find minimum of a function.

steps:

  1. initialize weights randomly
  2. change each weight in proportion to the negative derivative of error wrt weight, ie W:=WηEWW:=W-\eta\frac{\partial E}{\partial W}, where η is learning rate
  3. terminate when

if gradient is large, curve is steep and we are likely far from minimum (overshooting). if gradient is small, curve is flat and we are likely close to minimum.

updating weights based on data points:

eg. consider network
img

let y^=z(2)\hat{y}=z^{(2)} be the output of this network.

backpropagation:

forward pass:

backward pass:

in general to compute δj()=Eaj()\delta_j^{(\ell)}=\frac{\partial E}{\partial a_j^{(\ell)}} for unit j at layer ℓ, we have

δj()={Ezj()g(aj()),j is output unit(kδk(+1)Wjk(+1))g(aj()),j is hidden unit\delta_j^{(\ell)}=\begin{cases} \frac{\partial E}{\partial z_j^{(\ell)}}\cdot g'(a_j^{(\ell)}), &j\text{ is output unit}\\ \left(\sum_k\delta_k^{(\ell+1)}W_{jk}^{(\ell+1)}\right)\cdot g'(a_j^{(\ell)}),&j\text{ is hidden unit} \end{cases}

matrix notation:

comparison

when to use neural network:

when not to use neural network:

comparison with decision tree:

neural network decision tree
data types images, audio, text tabular data
size of data lots; easily overfit little
form of target function can model any function nested if-else
architecture # layers, # neurons per layer,
activation func, initial weights,
learning rate;
all are important
some params to prevent overfitting
interpret learned function blackbox; difficult to interpret easily interpretable
time available to train slow fast

Week 10. Nov 7

clustering

unsupervised learning tasks:

clustering: common unsupervised representation learning task.

two types of clustering:

k-means

eg. k=3, x=[x1,x2]
img

algo. (k-means)

  1. assign each example xx to a random cluster: YNmY\in\N^m
  2. randomly initialize kk centroids CMk×n(R)C\in M_{k\times n}(\R)
  3. while not converged:
    1. for each cluster cc:
      1. calculate centroid by calculating average feature value for each example currently classified as cluster cc: Cc1ncj=1ncXcjC_c\leftarrow \frac{1}{n_c}\sum_{j=1}^{n_c}X_{cj}
    2. for each example xx:
      1. assign x to the cluster whose centroid is closest: Yiarg mincd(Cc,Xi)Y_i\leftarrow \argmin_cd(C_c,X_i)

finding best solution:

the elbow method:

  1. run k-means with multiple values of k{1,2,...,kmax}k\in\{1,2,...,k_{\max}\}
  2. plot average distance across all examples and assigned clusters
  3. select k where there is drastic reduction in error

img

it is manual so can be ambiguous.

sihouette analysis

  1. run k-means with multiple values of k{1,2,...,kmax}k\in\{1,2,...,k_{\max}\}
  2. calculate average sihouette score s(x)s(x) for each k across data set where

    s(x)={b(x)a(x)max(a(x),b(x)),Cx>10,Cx=1s(x)=\begin{cases} \frac{b(x)-a(x)}{\max(a(x),b(x))}&,|C_x|>1\\ 0&,|C_x|=1 \end{cases}

  3. choose k that maximizes the score

autoencoder

also a representation learning algo that learns to map examples to low-dimensional representation

it has 2 main components

eg. linear autoencoder is simplest form of autoencoder, where ee and dd are linear functions with shared weight matrix WW:

z^=Wxx^=Wz^\begin{aligned} \hat{z}&=Wx\\ \hat{x}&=W^\intercal\hat z \end{aligned}

eg. deep neural network autoencoder

img

generative adversarial networks (GANs)

a generative unsupervised learning algo to generate unseen examples that look like training examples

GAN iis pair of neural networks:

algorithmic bias

Mathieu Doucet

examples: HR system fir screening job applicants based on ML; credit systems for screening loan applicants and setting interest rates; insurance systems; criminal risk assessment; crime prediction system to deploy police resources; university admission system

some questions

statistical/moral

bias-in-bias-out

policymaker problem: how to use the prediction (if it is biased)

solutions:

can the algorithm itself reflect bias?

what are we predicting?

Week 11. Nov 14

explainability

black box model: a model whose predictions are not understood because of its high complexity.

interpretability: how well we can understand how a model gives a prediction

explainability: how well we can understand why a model produces an output

intrinsic explanations: explanations are a consequence of interpretable model design

post-hoc explanations: explanations for models that are already trained

global explanations: explains functionality of entire model

local explanations: explanations for individual predictions or for related examples

model-specific explanations: method only works for specific type of model architecture

model-agnostic explanations: method can produce explanations for any model architecture

LIME (local interpretable model-agnostic explanations)

Marco T ́ulio Ribeiro, S. Singh, and C. Guestrin, “’Why Should I Trust You?’: Explaining the predictions of any classifier,” CoRR, vol. abs/1602.04938, 2016

categorization: post-hoc, model-agnostic, local

steps:

  1. normalize features
  2. given any model ff that takes input xx and returns y^\hat{y}, create a new dataset XX' of fictitious examples surrounding xx
  3. train an interpretable model gg using xXx'\in X' as inputs and f(x)f(x') as labels
  4. the explanation is obtained by inspecting gg

advantages:

disadvantages

MACE (model-agnostic counterfactual explanations)

A. H. Karimi, G. Barthe, B. Balle, and I. Valera, “Model-agnostic counterfactual explanations for consequential decisions,” 2020, vol. 108, pp. 895–905

categorization: post-hoc, model-agnostic, local

counterfactual explanation: for example xx, a counterfactual explanation is another (usually fictitious) example xcx_c such that xxc,f(x)f(xc)x\neq x_c,f(x)\neq f(x_c).

steps:

  1. create a characteristic formula ϕf\phi_f that describes model's functionality
  2. define a counterfactual formula ϕCFf\phi_{\text{CF}f} for example xx that is true for counterfactual xcx_c such that f(x)f(xc)f(x)\neq f(x_c) and d(x,xc)δd(x,x_c)\leq\delta
  3. find closest counterfactual xcx_c that satisfies ϕ(x,xc)=ϕCFf(xc)ϕd(x,xc)\phi(x,x_c)=\phi_{\text{CF}f}(x_c)\land\phi_d(x,x_c)

approximation:

not all explanations are useful. we can add more conditions (plausibility criteria) to the logical formula that specify features that cannot be changed.

eg. You have a client who wants to sell their house for $1 500 000, but your model predicts it will sell for $1 430 000. The client asks why your model predicts a price lower than their target, so you apply MACE specifying in the logical formula that the minimum prediction is $1 500 000.

advantages:

disadvantages:

decision network

decision network = bayes net + actions + utilities

eg. The robot must choose its route to pick up the mail. There is a short route and a long route. On the short route, the robot might slip and fall. The robot can put on pads. Pads won’t change the probability of an accident. However, if an accident happens, pads will reduce the damage. Unfortunately, the pads add weight and slow the robot down. The robot would like to pick up the mail as quickly as possible while minimizing the damage caused by an accident.

img

there is no unique utility function. use common sense to choose constraints

eg. when accident occurs, does robot prefer short route or long?
the robot must have taken the short route, so there is no utility for PSA\overline{PS}A and PSAP\overline{S}A.

eg. when an accident occurs, does robot prefer pads?
the robot would prefer pads as it reduces damage => U(PSA)>U(PSA)U(PSA)>U(\overline{P}SA).

how to choose an action:

  1. set evidence variables for current state
  2. for each possible value of decision node
    1. set decision node to that value
    2. calculate posterior probability for parent nodes of the utility node
    3. calculate expected utility for action
  3. return action with highest expected utility

eg. compute expected utility of not wearing pads and choosing long route

EU(P,S)=P(PSAPS)U(PSA)+P(PSAPS)U(PSA)=P(APS)U(PSA)+P(APS)U(PSA)=P(AS)U(PSA)+P(AS)U(PSA)=16+0=6\begin{aligned} E_U(\overline{P},\overline{S})&=P(\overline{PSA}|\overline{PS})U(\overline{PSA})+P(\overline{PS}A|\overline{PS})U(\overline{PS}A)\\ &=P(\overline{A}|\overline{PS})U(\overline{PSA})+P(A|\overline{PS})U(\overline{PS}A)\\ &=P(\overline{A}|\overline{S})U(\overline{PSA})+P(A|\overline{S})U(\overline{PS}A)\\ &=1\cdot6+0\cdot-\infty\\ &=6 \end{aligned}

similarly we need other three values.

img

we can instead use VEA to evaluate the network:

  1. prune all nodes that are not ancestors of utility node
  2. eliminate (sum out) all chance nodes
  3. for the single remaining factor, return the max assignment that gives max value

since we are making S, P decisions at same time, we can combine them into a single node, and its domain is the cross product of domains of original nodes.

img

eg. consider the weather network, where decision node depends on a chance node.

img

since forecast has 3 values and there are 2 possible decisions, there are 2^3=8 possible policies.

eg. consider the policy: take umbrella iff forecast is rainy. what is expected utility?

EU = P(no_rain,leave_it)*u(no_rain,leave_it)
     + P(no_rain,take_it)*u(no_rain,take_it)
     + P(rain,leave_it)*u(rain,leave_it)
     + P(rain,take_it)*u(rain,take_it)
   = P(no_rain)*P(sunny or cloudy|no_rain)*u(no_rain,leave_it)
     + P(no_rain)*P(rainy|no_rain)*u(no_rain,take_it)
     + P(rain)*P(sunny or cloudy|rain)*u(rain,leave_it)
     + P(rain)*P(rainy|rain)*u(rain,take_it)
   = .7*(.7+.2)*100
     + .7*.1*20
     + .3*(.15+.25)*0
     + .3*.6*70
   = 77

VEA steps:

  1. prune all nodes that are not ancestors of utility node
  2. define a factor for every non-decision node
  3. while there are decision nodes remaining
    1. eliminate every random variable that is not a parent of a decision node
    2. find optimal policy for the last decision and eliminate decision variable
  4. return optimal policies
  5. determine expected utility following the optimal policy by eliminating all remaining random variables

img

Week 12. Nov 21

markov decision process

ongoing problems:

img

comparing policies:

defn. the discounted reward/return is G(S0)=t=0γtR(St)G(S_0)=\sum_{t=0}^\infty\gamma^tR(S_t)

variations of MDP:

eg. robot escapes the grid

img

the definition of reward function can have impact on the optimal policy

expected utility of policy

let

we can obtain the optimal policy using the expected utility

  1. compute the action value function: Q(s,a)=sP(ss,a)V(s)Q^*(s,a)=\sum_{s'}P(s'|s,a)V^*(s')
  2. in state s, choose an action that maximizes this: π(s)=arg maxaQ(s,a)\pi^*(s)=\argmax_a Q^*(s,a)

eg. given values of VV^* for γ=1\gamma=1, what is the optimal action for state s13s_{13}?
img

value iteration

we know

combing them we get the S|S| bellman equations for S|S| unknowns:

V(s)=R(s)+γmaxasN(s)P(ss,a)V(s)V^*(s)=R(s)+\gamma\max_a\sum_{s'\in N(s)}P(s'|s,a)V^*(s')

they are nonlinear equations because of 'max'. there is no poly time algo.

algo. (value iteration)

  1. start with arbitrary initial values for V0(s)V_0(s)
  2. at ith iteration, compute Vi+1(s)R(s)+γmaxasN(s)P(ss,a)Vi(s)V_{i+1}(s)\leftarrow R(s)+\gamma\max_a\sum_{s'\in N(s)}P(s'|s,a)V_i(s')
  3. terminate when maxsVi(s)Vi+1(s)\max_s|V_i(s)-V_{i+1}(s)| is small enough

the ViV_i's are guaranteed to converge to optimal values if we iterate infinitely often.

each state accumulates negative rewards until algo finds a path to the +1 goal node.

we can update V(s)V^*(s) for all states s by:

note: don't need to compute V for terminal nodes.

policy iteration

we do not need accurate estimates of the util function to have an optimal policy.

alternates between two steps:

  1. policy evaluation: given a policy πi\pi_i we compute Vπi(s)V^{\pi_i}(s)
  2. policy improvement: compute new policy πi+1\pi_{i+1} using VπiV^{\pi_i} 3 terminate when there is no changes in policy

algo. (policy iteration)

  1. start with a random policy
  2. evaluate policy: Vπi(s)=R(s)+γsP(ss,πi(s))Vπi(s)V^{\pi_i}(s)=R(s)+\gamma\sum_{s'}P(s'|s,\pi_i(s))V^{\pi_i}(s')
  3. improve policy: πi+1(s)=arg maxaQπi(s,a)=arg maxasP(ss,a)Vπi(s)\pi_{i+1}(s)=\argmax_a Q^{\pi_i}(s,a)=\argmax_a\sum_{s'}P(s'|s,a)V^{\pi_i}(s')
  4. terminate when there is no changes in policy

it must terminate as there finite # of policies and each step yields a better one.

to solve the policy evaluation we can also do approximation by performing a number of simplified value iteration steps

  1. for each step j, do Vj+1(s)R(s)+γsP(ss,π(s))Vj(s)V_{j+1}(s)\leftarrow R(s)+\gamma\sum_{s'}P(s'|s,\pi(s))V_j(s').

eg. apply policy iteration for given grid. assume γ=1\gamma=1. agent moves to intended position with intended direction with prob 0.8; 0.1 prob for turning left, 0.1 prob for turning right.

+-----+-----+
|-0.04| +1  |
+-----+-----+
|-0.04| -1  |
+-----+-----+

initial policy is π1(s) = right for all s. then we have

V(s11) = -.04 + 1(.8(1) + .1V(s11) + .1V(s21))
V(s21) = -.04 + 1(.8(-1) + .1V(s11) + .1V(s21))
=> V(s11) = .75, V(s21) = -.85

π2(s11) = argmax{
    .8(1) + .1(.75) + .1(-.85),  // right
    .8(.75) + .1(.75) + .1(1),   // up
    .8(.75) + .1(-.85) + .1(.75),// left
    .8(-.85) + .1(.75) + .1(1)   // down
} = right
π2(s12) = argmax{...} = up

to use iterative approach to solve for V:

let V(s11)_0 = V(s21)_0 = 0

V(s11)_1 = -.04 + 1(.8(1) + .1(0) + .1(0)) = .76
V(s21)_1 = -.04 + 1(.8(-1) + .1(0) + .1(0)) = -.84
V(s11)_2 = -.04 + 1(.8(1) + .1(.76) + .1(-.84))
V(s21)_2 = ...

Week 13. Nov 29

reinforcement learning

consider fully observable, single-agent reinforcement learning. it is a markov decision process:

challenges in reinforcement learning:

passive adaptive dynamic programming

simplified case: passive learning agent

algo. (passive ADP)

  1. follow policy π\pi and generate an experience s,a,s,r\lang s,a,s',r'\rang
  2. update reward function R(s)=rR(s')=r'
  3. update transition probability:
  4. derive Vπ(s)V^\pi(s) by using bellman eqs: V(s)=R(s)+γsP(ss,π(s))V(s)V(s)=R(s)+\gamma\sum_{s'}P(s'|s,\pi(s))V(s')
  5. repeat

active adaptive dynamic programming

each step, agent can

3 strategies for exploration & exploration:

for optimistic util values, we learn V+(s)V^+(s) (the optimistic estimates of V(s)):

V+(s)R(s)+γmaxaf(sP(ss,a)V+(s),N(s,a))f(u,n)={R+,n<Neu,elseV^+(s)\leftarrow R(s)+\gamma\max_a f\left(\sum_{s'}P(s'|s,a)V^+(s'),N(s,a)\right)\\ \,\\ f(u,n)=\begin{cases} R^+,&n<N_e\\ u,&\text{else} \end{cases}

where

algo. (active ADP)

  1. initialize R(s),V+(s),N(s,a),N(s,a,s)R(s),V^+(s),N(s,a),N(s,a,s')
  2. repeat until we have visited each (s,a)(s,a) at least NeN_e times and V+V^+ converges
    1. determine best action a for current state s using V+(s)V^+(s)

      a=arg maxaf(sP(ss,a)V+(s),N(s,a))a=\argmax_a f\left(\sum_{s'}P(s'|s,a)V^+(s'),N(s,a)\right)

      • or random action (greedy) / softmax
    2. take action aa and generate an experience s,a,s,r\lang s,a,s',r'\rang
    3. update reward function R(s)rR(s')\leftarrow r'
    4. update transition probability:
      • N(s,a)=N(s,a)+1N(s,a)=N(s,a)+1
      • M(s,a,s)=M(s,a,s)+1M(s,a,s')=M(s,a,s')+1
      • P(ss,a)=M(s,a,s)N(s,a)P(s'|s,a)=\frac{M(s,a,s')}{N(s,a)}
    5. update V+(s)R(s)+γmaxaf(sP(ss,a)V+(s),N(s,a))V^+(s)\leftarrow R(s)+\gamma\max_a f\left(\sum_{s'}P(s'|s,a)V^+(s'),N(s,a)\right)

eg. assume γ=.9,Ne=.9,R+=5,R(s11)=R(s21)=.04\gamma=.9,N_e=.9,R^+=5,R(s_{11})=R(s_{21})=-.04,

+-----+-----+
|-0.04| +1  |
+-----+-----+
|-0.04| -1  |
+-----+-----+

observations:
N(s,a) = 5 for all s, a
N(s,a,s') = 3 for intended direction
          = 1 for any other direction with positive transition prob

current estimates:
V+(s11) = -.6573, V+(s21) = .9002

at first, f selects R+,

a = argmax{5,5,5,5}  => choose a random policy, say a=down
suppose we experienced <s11,down,s21,-.04>
update R(s21) = .04
       N(s11,down) = 5+1 = 6
       M(s21,down,s21) = 3+1 = 4
       P(s21|s11,down) = 4/6
       P(s12|s11,down) = 1/6
       P(s11|s11,down) = 1/6
       P(s22|s11,down) = 0/6
       V+(s11) = -.04 + .9 max{5,5,5,5} = 4.46
       V+(s21) = ... = 4.46

Q-learning

we can learn either

assume we observed s,a,s,r\lang s,a,s',r'\rang, assume transition always occur (P(ss,a)=1P(s'|s,a)=1), then we have an estimate Q(s,a)=R(s)+γmaxaQ(s,a)Q'(s,a)=R(s')+\gamma\max_{a'} Q(s',a'). we define the temproal difference (TD) to be Q(s,a)Q(s,a)Q'(s,a)-Q(s,a).

algo. (passive Q-learning)

  1. follow policy π\pi and generate an experience s,a,s,r\lang s,a,s',r'\rang
  2. update reward function: R(s)=rR(s')=r'
  3. update Q(s,a) using the temporal difference update rule

    Q(s,a)Q(s,a)+α(R(s)+γmaxaQ(s,a)Q(s,a))=(1α)Q(s,a)+α(R(s)+γmaxaQ(s,a))\begin{aligned} Q(s,a)&\leftarrow Q(s,a)+\alpha\left(R(s')+\gamma\max_{a'} Q(s',a')-Q(s,a)\right)\\ &=(1-\alpha)Q(s,a)+\alpha\left(R(s')+\gamma\max_{a'} Q(s',a')\right) \end{aligned}

  4. repeat

where 0<α<10<\alpha<1 is learning rate. if α\alpha decreases as N(s,a) increases, Q values will converge to optimal values. eg α(N(s,a))=109+N(s,a)\alpha(N(s,a))=\frac{10}{9+N(s,a)}.

algo. (active Q-learing)

  1. initialize R(s),Q(s,a),N(s,a),M(s,a,s)R(s),Q(s,a),N(s,a),M(s,a,s')
  2. repeat until we have visited each (s,a)(s,a) at least NeN_e times and Q(s,a)Q(s,a) converges
    1. determine best action a for current state s

      a=arg maxaf(Q(s,a),N(s,a))a=\argmax_a f\left(Q(s,a),N(s,a)\right)

      • or random action (greedy) / softmax
    2. take action aa and generate an experience s,a,s,r\lang s,a,s',r'\rang
    3. update reward function R(s)rR(s')\leftarrow r'
    4. update Q(s,a) using the temporal difference update rule (greedy)

    Q(s,a)Q(s,a)+α(R(s)+γmaxaQ(s,a)Q(s,a)) Q(s,a)\leftarrow Q(s,a)+\alpha\left(R(s')+\gamma\max_{a'} Q(s',a')-Q(s,a)\right)

eg. how to initialize Q(sT,a)Q(s_T,a) for terminal state sTs_T on any action a? init to 0 since there is no action to take => no util.

remark. we get the policy using π(s)=arg maxaQ(s,a)\pi(s)=\argmax_a Q(s,a)

properties of Q-learning:

ADP vs Q-learning:

state-action-reward-state-action SARSA

in Q-learning, we have s,a,r,s,a\lang s,a,r',s',a'\rang as experience.

algo. (SARSA)

  1. initialize R(s),Q(s,a)R(s),Q(s,a)
  2. repeat until Q(s,a)Q(s,a) converges
    1. if starting a new episode, determine aa for initial state s0s_0 using current policy (determined by exploration strategy). ss0s\leftarrow s_0
    2. take action aa and generate an experience s,a,r,s\lang s,a,r',s'\rang
    3. update reward function: R(s)rR(s')\leftarrow r'
    4. determine action aa' for state ss' using current policy
    5. update Q(s,a)Q(s,a) using temporal difference update rules

    Q(s,a)Q(s,a)+α(R(s)+γQ(s,a)Q(s,a)) Q(s,a)\leftarrow Q(s,a)+\alpha\left(R(s')+\gamma Q(s',a')-Q(s,a)\right)

    1. update aa,ssa\leftarrow a',s\leftarrow s'

Q-learning vs. SARSA

eg. in a high-stakes online learning problem (eg self-driving car), would it better to use Q-learning or SARSA? use SARSA as it is often more likely to learn less risk policies.