Black-Box-Policy-Optimization slides

Black Box Policy Gradient¶

Here we demonstrate how the policy can be directly be learned by the cross entropy method (some kind of evolutionary algorithm) as suggested in the Open AI Deep Reinforcement Learning Tutorial.

The core idea of the cross entropy method is similar to evolution strategy[Wie14]:

maintain a search distribution
sampling from the search distribution
and evaluate the samples for updating the search distribution, iteratively.

Here we use a search distribution a multivariate Gaussian with diagonal covariance matrix. To update the search distribution the best offsprings (elite) are used to calculate the new parameters of the Gaussian (the mean and the variance).

Cart Pole Problem¶

In [2]:

import numpy as np
import numpy.testing as npt

import matplotlib.pyplot as plt
%matplotlib inline

In [3]:

import gym
env = gym.make('CartPole-v0')

[2017-05-12 09:25:17,431] Making new env: CartPole-v0

In [4]:

env.observation_space

Out[4]:

Box(4,)

In [5]:

observation = env.reset()
#[position of cart, velocity of cart, angle of pole, rotation rate of pole]
observation

Out[5]:

array([-0.0258769 , -0.04715221,  0.01703425, -0.03715459])

In [6]:

env.action_space

Out[6]:

Discrete(2)

The action space is one-dimensional with two discrete actions $a \in \{0, 1\}$.

We use (similar to logistic regression) a log-linear model (logistic regression) for the probability of action $a=1$ for state $\vec s$ parametrized by $\theta = \{ b, \vec w\}$

$$ p(a=1 \mid \vec s; \theta) = \frac{1}{1+\exp(-(\vec s \cdot \vec w)+b)} $$

The probability of action $a=0$ is $p(a=0 \mid \vec s; \theta) = 1-p(a=1 \mid \vec s; \theta)$

In [7]:

def logistic_function(z):
    return 1/(1 + np.exp(-z))

In [8]:

plt.figure(figsize=(3, 1.5))
z = np.arange(-5, 5, 0.1)
plt.plot(z, logistic_function(z), 'b-')
plt.title("logistic function")
plt.xlabel('z')
plt.ylabel('f(z)')
plt.ylim(-0.1, 1.1)

Out[8]:

(-0.1, 1.1)

Sampling the action for an observation (state) and a fixed $\theta$ from $p(a \mid \vec s; \theta)$:

In [9]:

def get_action(theta, observation):
    p =  logistic_function(np.dot(theta[1:], observation) + theta[0])
    return np.random.binomial(n=1, p=p, size=1)[0]

Implementation of Algorithm 1: Cross Entropy Method:

Note: The environment stops automatically after 200 steps.

In [13]:

def optimize_policy(nb_generations = 200, 
                    nb_offsprings = 250,
                    nb_elite = 50, 
                    n=5):
    
    # start with a random mean and random diagonal covariance matrix
    mu = np.random.rand(n) * 10. - 5.
    cov = np.eye(n) * np.random.rand(n) * 10.
    
    for ii in range(nb_generations):
        # Draw random samples for the offsprings from a multivariate normal distribution
        offsprings = np.random.multivariate_normal(mean=mu, cov = cov, size=nb_offsprings)
        # the returns (fittness value) for each offspring
        returns = np.zeros(nb_offsprings)
        for i, theta in enumerate(offsprings):
            done = False
            observation  = env.reset()
            return_ = 0
            # run the agent in the environment to get the corresponding return (fitnesvalue)
            while not done:
                action = get_action(theta, observation)
                observation, reward, done, info = env.step(action)
                return_ += reward
            returns[i] = return_     

        # sort the offsprings according to the fitness values 
        # and compute the new gaussian parameters (mu and cov) from the 
        # best offsprings
        rang = np.argsort(returns)[::-1]
        mu = offsprings[rang[:nb_elite]].mean(axis=0)
        cov = np.eye(n) * offsprings[rang[:nb_elite]].var(axis=0)
    
        #print progress
        #if ii%10 == 0: 
        #    print (ii, returns[rang[:nb_elite]].mean())
            
        if returns[rang[:nb_elite]].mean() >= 200.:
            return mu

    return mu
            
theta = optimize_policy()

In [14]:

theta

Out[14]:

array([ 0.15929439, -0.59793475,  1.06206644,  4.38717968,  3.77750642])

Use the following code snippet for

In [15]:

observation  = env.reset()
for t in range(5000):
    env.render()
    action = get_action(theta, observation)
    #print ("random action:", action)
    observation, reward, done, info = env.step(action)
    # observation =  
    #  [position of cart, velocity of cart, angle of pole, angular velocity of pole]
    #print (observation, reward, done, info)
    if done:
        print("Episode finished after {} timesteps".format(t+1))
        break
        
env.render(close=True)

Episode finished after 116 timesteps

Open Questions / Exercises¶

Try other kinds of evolutionary algorithms, e.g. evolution stategy [Wie14] and compare the results.
Does it help to fit the full covariance matrix not only the diagonal elements?

Literature:¶

Open AI Deep Reinforcement Learning Tutorial
Szita, István, and András Lörincz. "Learning Tetris using the noisy cross-entropy method." Neural computation 18.12 (2006): 2936-2941.
De Boer et.al. : A tutorial on the cross entropy method, Annals of Operations Research 134, 19–67, 2005
[Wie14] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, Jürgen Schmidhuber: "Natural Evolution Strategies.", JMLR; 15(Mar):949−980, 2014.