temporal_difference_learning slides

Temporal-Difference (TD) Learning¶

here only TD(0)

Temporal-Difference Learning¶

TD methods learn directly from episodes of experiments
TD is model-free: no knowledge of MDP transitions / rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

TD-Predition¶

from [SuBa] 6.1 TD Prediction

General Rule: We move the estimate towards the target:

$$ \text{newEstimate} \leftarrow \text{oldEstimate} + \alpha (\text{Target} - \text{oldEstimate} ) $$

Constant $\alpha$ MC-Update (every visit):¶

We write $s_t$ as an abbreviation for $S_t = s$ and $s'_{t+1}$ for $S_{t+1}=s'$.

$$ v(s_t) \leftarrow v(s_t) + \alpha (\mathcal R_{t} - V(s_t)) $$

with

$\mathcal R_t$: actual return following time $t$
$v(s_t)$: Value function of the state at time $t$

TD-Update for $v$:¶

Update value $v(s_t)$ towards estimated return $R_{t+1} + \gamma V(s_{t+1})$:

$$ v(s) \leftarrow v(s) + \alpha \left(R_{t+1} + \gamma V(s'_{t+1}) - V(s_t)\right) $$

with

$R_{t+1}$: reward after
$v(s_t)$: Value function of the state at time $t$
$R_{t+1} + \gamma V(s_{t+1})$ is called the TD target
$\delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ is called the TD error

Bootstrapping: update is based partially on an existing estimate.

TD methods combine the sampling of Monte Carlo with the bootstrapping of DP.

In [4]:

import gym
env = gym.make('FrozenLake-v0')
env.action_space

[2016-11-25 10:59:55,816] Making new env: FrozenLake-v0

Out[4]:

Discrete(4)

In [5]:

env.observation_space

Out[5]:

Discrete(16)

In [7]:

env.step(0)

Out[7]:

(0, 0.0, False, {'prob': 0.3333333333333333})

Sarsa - On Policy TD Control¶

SARSA: State-Action Reward State-Action

Read:

6.4 Sarsa: On-Policy TD Control

TD-Update for $q$:¶

$$ q(s_t, a_t) \leftarrow q(s_t, a_t) + \alpha \left(R_{t+1} + \gamma q\left(s'_{t+1}, \pi(s')\right) - q(s_t, a_t) \right) $$

with

$R_{t+1}$
$q(s_t, a_t)$: Value function of the state at time $t$

Bootstrapping: update is based partially on an existing estimate.

TD methods combine the sampling of Monte Carlo with the bootstrapping of DP.

In [77]:

# SARSA: On-policy TD control algorithm
def sarsa(alpha = 0.02, 
              gamma = 1., 
              epsilon = 0.05,
              q = None, 
              progress = None, 
              env=env):
    
    if q is None:
        q = np.ones((16,4)) 
    
    for i in range(nb_episodes):
        done = False
        s = env.reset()
        a = action_epsilon_greedy(q, s, epsilon=epsilon)
        while not done:
            new_s, reward, done, info = env.step(a)
            new_a = action_epsilon_greedy(q, new_s, epsilon=epsilon)
            q[s, a] = q[s, a] + alpha * (reward + gamma * q[new_s, new_a] - q[s, a])
            s = new_s
            a = new_a
        
        # only for plotting the performance, not part of the algorithm 
        if progress is not None and i%STEPS == 0:
            progress[i//STEPS] = average_performance(get_action_epsilon_greedy(epsilon), q=q)
    return q, progress

In [159]:

plot_performance(param_performance)

As you can see the larger $\epsilon$ the faster is the learning. But the larger the $\epsilon$ the lower is the average reward per episode.

A hack could be to use at the end the greedy policy. But keep in mind, that we have not estimated the q-values of the greedy policy. So this can be problematic in more elaborate environments.

In [86]:

plt.plot(STEPS*np.arange(nb_episodes//STEPS), sarsa_performance)
plt.xlabel("epochs")
plt.title("Learning progress for SARSA")
plt.ylabel("average reward of an epoch")

Out[86]:

<matplotlib.text.Text at 0x10b015a90>

In [90]:

def greedy_policy(q, s):
    return np.argmax(q[s])
print(average_performance(greedy_policy, q=q))

0.838

Q-Learning - Off Policy Control¶

Read [SuBa] 6.5 Q-Learning: Off-Policy TD Control

Update rule:

$$ q(s_t, a_t) \leftarrow q(s_t, a_t) + \alpha \left(R_t + \gamma \cdot \text{arg}\max_{a'} \left(q\left(s'_{t+1}, a'\right)\right) - q(s_t, a_t) \right) $$

Here:

estimation policy is greedy
behaviour policy is $\epsilon$-greedy

In [73]:

# Q-Learning: Off-policy TD control algorithm
for i in range(nb_episodes):
    done = False
    s = env.reset()
    while not done:
        a = action_epsilon_greedy(q, s) # behaviour policy 
        new_s, reward, done, info = env.step(a)
        a_max = np.argmax(q[new_s]) # estimation policy 
        q[s, a] = q[s, a] + alpha * (reward + gamma * q[new_s, a_max] - q[s, a])
        s = new_s
        
        
    # for plotting the performance    
    if i%STEPS == 0:
        q_performance[i//STEPS] = average_performance(greedy_policy, q)

In [91]:

average_performance(greedy_policy, q)

Out[91]:

0.828

Bias/Variance Trade-Off¶

return $G_t = R_{t+1}+ \gamma R_{t+2} + \dots + \gamma^{T-1}R_T$ is unbiased estimate of $v_\pi(S_t)$
true TD target $R_{t+1} + \gamma v_\pi (S_{t+1})$ is unbiased estimate of $v_\pi(S_t)$
TD target $R_{t+1} + \gamma V_\pi (S_{t+1})$ is biased estimate of $v_\pi(S_t)$
TD target is much lower variance than the return:
- return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward

Advantages and Disadvantages of MC vs. TD¶

TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-termiating) environments
- MC only works for episodic (terminating) environments

Bias/Variance Trade-Off¶

return $G_t = R_{t+1}+ \gamma R_{t+2} + \dots + \gamma^{T-1}R_T$ is unbiased estimate of $v_\pi(S_t)$
true TD target $R_{t+1} + \gamma v_\pi (S_{t+1})$ is unbiased estimate of $v_\pi(S_t)$
TD target $R_{t+1} + \gamma V_\pi (S_{t+1})$ is biased estimate of $v_\pi(S_t)$
TD target is much lower variance than the return:
- return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward

Advantages and Disadvantages of MC vs. TD (2)¶

MC has high variance, zero bias
- good convergence properties
- (even with function approximation)
- not very sensitive to initial value
- very simple to understand and use
TD has low variance, some bias
- usually more efficent than MC
- TD(0) converges to $V_\pi(s)$
- (but not always with function approximation)
- more sensitive to initial values

Certainty Equivalence¶

MC converges to solution with minimum mean-squared error
- best fit to the observed returns $$ \sum_{k=1}^K \sum_{t=1}^{T_k} \left(g_t^k - V(s_t^k)\right)$$
TD(0) converges to solution of max likelihood Markov model
- solution to the MDP $\left< \mathcal S, \mathcal A, \mathcal{\hat P}, \mathcal{\hat R}, \gamma \right>$ that best fits the data $$ \mathcal{\hat P^a_{s,s'}} = \frac{1}{N(s,a)}\sum_{k=1}^{K}\sum_{t=1}^{T_k}\mathbb 1(s_t^k, a_t^k,s_{t+1}^k=s,a,s') $$ $$ \mathcal{\hat R^a_{s,s'}} = \frac{1}{N(s,a)}\sum_{k=1}^{K}\sum_{t=1}^{T_k}\mathbb 1(s_t^k, a_t^k=s,a)r_t^k $$

Advantages and Disadvantages of MC vs. TD (3)¶

TD exploits Markov property
- usually more efficient in Markov environments
MC does not exploid Markov peoperty
- usually more effective in non-Markov environments

Bootstapping and Sampling¶

Bootstrapping: update involves an estimate
- MC does bootstrap
- DP bootstraps
- TD bootstraps
Sampling:
- MC samples
- DP does not sample
- TD samples

Literature:¶

[SuBa] Sutton, Richard S., and Andrew G. Barto. Introduction to reinforcement learning. Vol. 135. Cambridge: MIT Press, 1998.
Szepesvári, Csaba. "Algorithms for reinforcement learning." Synthesis lectures on artificial intelligence and machine learning 4.1 (2010): 1-103.
D. Silver. UCL Course on RL