Lecture 6: Model-Free Control

Author: Mauro Comi

Chapter 6 on the Sutton and Barto.

Model-free prediction: estimate the value function of an unknown MDP.

Example, sample of tabular update with TD-learning:

$v_{t+1}(S_t) = v_t(S_t) + \alpha(R_{t+1} + \gamma v_t(S_{t+1}) - v_t(S_t))$
Model-free control: optimise the value function of an unknown MDP.

🟧 Recap

We have seen a few methods to optimise the value function.

Untitled

In short:

Start with random $\pi$ and $V$
Policy evaluation: compute state values $V$ under policy $\pi$ until convergence
Policy improvement: change our policy $\pi_{old}$ to the greedy policy $\pi_{new}$
Repeat.. policy evaluation + policy improvement until there is no change in the policy.

Model-free Policy Evaluation

Untitled

Model-Free Policy Iteration

Greedy policy is simply estimated as: