temporal_difference_learning slides

Temporal-Difference (TD) Learning

here only TD(0)

Temporal-Difference Learning

  • TD methods learn directly from episodes of experiments
  • TD is model-free: no knowledge of MDP transitions / rewards
  • TD learns from incomplete episodes, by bootstrapping
  • TD updates a guess towards a guess

TD-Predition

from [SuBa] 6.1 TD Prediction

General Rule: We move the estimate towards the target:

$$ \text{newEstimate} \leftarrow \text{oldEstimate} + \alpha (\text{Target} - \text{oldEstimate} ) $$

Constant $\alpha$ MC-Update (every visit):

We write $s_t$ as an abbreviation for $S_t = s$ and $s'_{t+1}$ for $S_{t+1}=s'$.

$$ v(s_t) \leftarrow v(s_t) + \alpha (\mathcal R_{t} - V(s_t)) $$

with

  • $\mathcal R_t$: actual return following time $t$
  • $v(s_t)$: Value function of the state at time $t$

TD-Update for $v$:

Update value $v(s_t)$ towards estimated return $R_{t+1} + \gamma V(s_{t+1})$:

$$ v(s) \leftarrow v(s) + \alpha \left(R_{t+1} + \gamma V(s'_{t+1}) - V(s_t)\right) $$

with

  • $R_{t+1}$: reward after
  • $v(s_t)$: Value function of the state at time $t$

  • $R_{t+1} + \gamma V(s_{t+1})$ is called the TD target

  • $\delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ is called the TD error

Bootstrapping: update is based partially on an existing estimate.

TD methods combine the sampling of Monte Carlo with the bootstrapping of DP.

In [4]:
import gym
env = gym.make('FrozenLake-v0')
env.action_space
[2016-11-25 10:59:55,816] Making new env: FrozenLake-v0
Out[4]:
Discrete(4)
In [5]:
env.observation_space
Out[5]:
Discrete(16)
In [7]:
env.step(0)
Out[7]:
(0, 0.0, False, {'prob': 0.3333333333333333})

Sarsa - On Policy TD Control

SARSA: State-Action Reward State-Action

Read:

TD-Update for $q$:

$$ q(s_t, a_t) \leftarrow q(s_t, a_t) + \alpha \left(R_{t+1} + \gamma q\left(s'_{t+1}, \pi(s')\right) - q(s_t, a_t) \right) $$

with

  • $R_{t+1}$
  • $q(s_t, a_t)$: Value function of the state at time $t$

Bootstrapping: update is based partially on an existing estimate.

TD methods combine the sampling of Monte Carlo with the bootstrapping of DP.

In [77]:
# SARSA: On-policy TD control algorithm
def sarsa(alpha = 0.02, 
              gamma = 1., 
              epsilon = 0.05,
              q = None, 
              progress = None, 
              env=env):
    
    if q is None:
        q = np.ones((16,4)) 
    
    for i in range(nb_episodes):
        done = False
        s = env.reset()
        a = action_epsilon_greedy(q, s, epsilon=epsilon)
        while not done:
            new_s, reward, done, info = env.step(a)
            new_a = action_epsilon_greedy(q, new_s, epsilon=epsilon)
            q[s, a] = q[s, a] + alpha * (reward + gamma * q[new_s, new_a] - q[s, a])
            s = new_s
            a = new_a
        
        # only for plotting the performance, not part of the algorithm 
        if progress is not None and i%STEPS == 0:
            progress[i//STEPS] = average_performance(get_action_epsilon_greedy(epsilon), q=q)
    return q, progress
In [159]:
plot_performance(param_performance)

As you can see the larger $\epsilon$ the faster is the learning. But the larger the $\epsilon$ the lower is the average reward per episode.

A hack could be to use at the end the greedy policy. But keep in mind, that we have not estimated the q-values of the greedy policy. So this can be problematic in more elaborate environments.

In [86]:
plt.plot(STEPS*np.arange(nb_episodes//STEPS), sarsa_performance)
plt.xlabel("epochs")
plt.title("Learning progress for SARSA")
plt.ylabel("average reward of an epoch")
Out[86]:
<matplotlib.text.Text at 0x10b015a90>
In [90]:
def greedy_policy(q, s):
    return np.argmax(q[s])
print(average_performance(greedy_policy, q=q)) 
0.838

Q-Learning - Off Policy Control

Read [SuBa] 6.5 Q-Learning: Off-Policy TD Control

Update rule:

$$ q(s_t, a_t) \leftarrow q(s_t, a_t) + \alpha \left(R_t + \gamma \cdot \text{arg}\max_{a'} \left(q\left(s'_{t+1}, a'\right)\right) - q(s_t, a_t) \right) $$

Here:

  • estimation policy is greedy
  • behaviour policy is $\epsilon$-greedy
In [73]:
# Q-Learning: Off-policy TD control algorithm
for i in range(nb_episodes):
    done = False
    s = env.reset()
    while not done:
        a = action_epsilon_greedy(q, s) # behaviour policy 
        new_s, reward, done, info = env.step(a)
        a_max = np.argmax(q[new_s]) # estimation policy 
        q[s, a] = q[s, a] + alpha * (reward + gamma * q[new_s, a_max] - q[s, a])
        s = new_s
        
        
    # for plotting the performance    
    if i%STEPS == 0:
        q_performance[i//STEPS] = average_performance(greedy_policy, q)
In [91]:
average_performance(greedy_policy, q)
Out[91]:
0.828

Bias/Variance Trade-Off

  • return $G_t = R_{t+1}+ \gamma R_{t+2} + \dots + \gamma^{T-1}R_T$ is unbiased estimate of $v_\pi(S_t)$
  • true TD target $R_{t+1} + \gamma v_\pi (S_{t+1})$ is unbiased estimate of $v_\pi(S_t)$
  • TD target $R_{t+1} + \gamma V_\pi (S_{t+1})$ is biased estimate of $v_\pi(S_t)$
  • TD target is much lower variance than the return:
    • return depends on many random actions, transitions, rewards
    • TD target depends on one random action, transition, reward

Advantages and Disadvantages of MC vs. TD

  • TD can learn before knowing the final outcome
    • TD can learn online after every step
    • MC must wait until end of episode before return is known
  • TD can learn without the final outcome
    • TD can learn from incomplete sequences
    • MC can only learn from complete sequences
    • TD works in continuing (non-termiating) environments
    • MC only works for episodic (terminating) environments

Bias/Variance Trade-Off

  • return $G_t = R_{t+1}+ \gamma R_{t+2} + \dots + \gamma^{T-1}R_T$ is unbiased estimate of $v_\pi(S_t)$
  • true TD target $R_{t+1} + \gamma v_\pi (S_{t+1})$ is unbiased estimate of $v_\pi(S_t)$
  • TD target $R_{t+1} + \gamma V_\pi (S_{t+1})$ is biased estimate of $v_\pi(S_t)$
  • TD target is much lower variance than the return:
    • return depends on many random actions, transitions, rewards
    • TD target depends on one random action, transition, reward

Advantages and Disadvantages of MC vs. TD (2)

  • MC has high variance, zero bias
    • good convergence properties
    • (even with function approximation)
    • not very sensitive to initial value
    • very simple to understand and use
  • TD has low variance, some bias
    • usually more efficent than MC
    • TD(0) converges to $V_\pi(s)$
    • (but not always with function approximation)
    • more sensitive to initial values

Certainty Equivalence

  • MC converges to solution with minimum mean-squared error
    • best fit to the observed returns $$ \sum_{k=1}^K \sum_{t=1}^{T_k} \left(g_t^k - V(s_t^k)\right)$$
  • TD(0) converges to solution of max likelihood Markov model
    • solution to the MDP $\left< \mathcal S, \mathcal A, \mathcal{\hat P}, \mathcal{\hat R}, \gamma \right>$ that best fits the data $$ \mathcal{\hat P^a_{s,s'}} = \frac{1}{N(s,a)}\sum_{k=1}^{K}\sum_{t=1}^{T_k}\mathbb 1(s_t^k, a_t^k,s_{t+1}^k=s,a,s') $$ $$ \mathcal{\hat R^a_{s,s'}} = \frac{1}{N(s,a)}\sum_{k=1}^{K}\sum_{t=1}^{T_k}\mathbb 1(s_t^k, a_t^k=s,a)r_t^k $$

Advantages and Disadvantages of MC vs. TD (3)

  • TD exploits Markov property
    • usually more efficient in Markov environments
  • MC does not exploid Markov peoperty
    • usually more effective in non-Markov environments

Bootstapping and Sampling

  • Bootstrapping: update involves an estimate
    • MC does bootstrap
    • DP bootstraps
    • TD bootstraps
  • Sampling:
    • MC samples
    • DP does not sample
    • TD samples

Literature: