The weights to different neurons of a full connected layer must be different initialized. If not all neurons in a layer have the same input and gradient. So they behave all equal.
Usually by random initialization of the weights the so called symmetrie breaking is achieved.
To examine the influence of the non-linear activations functions $g$, note that the error responsibilities for the hidden units are given by (see here):
$$ \vec \delta^{(l)} = g'(\vec z^{(l)}) \circ ( (\theta^{(l)})^T \cdot \vec \delta^{(l+1)} ) $$
So if $z^{(l)}$ is small or large the $g'(\vec z^{(l)})$ are nearly zero for sigmoid shaped activation functions. The units are saturatated.
With small or nearly zero $\delta$'s the learning (weight changes) will be very slow for gradient descent because of the learning rule:
$$ \theta^{(l)} \leftarrow \theta^{(l)} - \alpha (\vec a^{(l)} \times \vec \delta^{(l+1)}) $$
To get a feeling about the range in which large derivatives can be expected, let's print the concrete values of $g'(z)$ for the common tanh and logistic activation functions:
import numpy as np
def logistic_function(z):
return 1./(1 + np.exp(-z))
# d(logistic(x))/dx = logistic(x) * (1. - logistic(x))
def derivative_logistic_function(z):
return logistic_function(z) * (1. - logistic_function(z))
#d(tanh(x))/dx = 1 - tanh^2(x)
def derivative_tanh(z):
return 1. - np.tanh(z)**2
for z in np.linspace(-20.,20,20):
dt = derivative_tanh(z)
dl = derivative_logistic_function(z)
print "z:%f\t d(tanh(z))/dx: %e \t d(logistic(z))/dx: %e"%(z,dt, dl)
A graph says more than a bunch of numbers:
%matplotlib inline
import matplotlib.pyplot as plt
z = np.linspace(-20.,20,200)
dt = derivative_tanh(z)
dl = derivative_logistic_function(z)
fig = plt.figure(figsize=(16, 5))
ax1 = fig.add_subplot(121)
ax1.plot(z, dt, label="d(tanh)/dx")
ax1.plot(z, dl, label="d(logistic)/dx")
ax1.set_title("Derivatives of activation functions - linear y-axis")
ax1.set_xlabel("z")
ax1.set_ylabel("dg(z)/dz")
ax1.legend()
ax2 = fig.add_subplot(122)
ax2.plot(z, dt, label="d(tanh)/dx")
ax2.plot(z, dl, label="d(logistic)/dx")
ax2.set_title("Derivatives of activation functions - logarithmic y-axis")
ax2.legend()
ax2.set_xlabel("z")
ax2.set_ylabel("log(dg(z)/dz)")
ax2.set_yscale('log')
plt.show()
For the logistic function and for random $z$ (symmetric around 0) the mean of the output of the logistic function is positive. In contrast the mean of the tanh is zero.
The $z$ values of the next layer are computed by $$ z_i^{(l+1)} = \sum_{j=0}^{n} \theta_{ij}^{(l)} a_j^{(l)} $$
Note that the $\theta_{ij}$ should be positive or negative. So for the logistic function the expectation value of $z_i$ is $0$, also.
A saturation of the units should be avoided but if we initialize the weights randomly with large values there is such a saturation. To demonstrate this, let's assume that about half of the neurons are active ($a = 1$) and the other are inactive ($a=0$).
This is similar to a one-dimesional random walk. So we expect a gaussian distribution of the probability density of the $z$-values with zero mean.
def z_logistic(fan_in, sigma_w):
a = np.random.binomial(1, 0.5, size=fan_in)
#a = np.random.uniform(low=0, high=1., size=fan_in)
# initalize the weights randomly by a uniform distribution
w = np.random.uniform(-sigma_w, sigma_w, size=fan_in)# * 2. - 1.
return np.sum(a * w)
fan_in = 200
nb = 10000
#plt.hist(zs, bins=int(np.sqrt(nb)))
#plt.show()
def plot_pdensities(zs, g, derivative):
fig = plt.figure(figsize=(16, 10))
ax1 = fig.add_subplot(221)
ax1.hist(zs, bins=int(np.sqrt(nb)), normed=True)
ax1.set_title("Probability density of z-values")
ax1.set_xlabel("z")
ax1.set_ylabel("p(z)")
ax1 = fig.add_subplot(222)
ax1.hist(g(zs), bins=int(np.sqrt(nb)), normed=True)
ax1.set_title("Probability density of activity-values")
ax1.set_xlabel("a")
ax1.set_ylabel("p(a)")
ax2 = fig.add_subplot(223)
ax2.hist(derivative(zs), bins=int(np.sqrt(nb)), normed=True)
ax2.set_title("Probability density of d(g(z))/dz")
ax2.set_xlabel("d(logistic(z))/dz")
ax2.set_ylabel("p(d(logistic(z))/dz)")
zs = np.ndarray(nb)
sigma_w = 1.
for i in range(nb):
zs[i] = z_logistic(fan_in, sigma_w)
plot_pdensities(zs, logistic_function, derivative_logistic_function)
Because of the large $\sigma$ of the gaussian probability density, many neurons are saturated. There are many neurons in the layer below and we have not adapted the range of the initialization uniform distribution to that number.
A common heuristic for weight initialization to prevent the saturation is [Glo10]
$$ W_{ij} \sim U [ - \frac{1}{fan\_in}, \frac{1}{fan\_in}] $$
with
This results in a much "nicer" distribution. So most of the neurons have no vanishing derivations.
sigma_w = 1./np.sqrt(fan_in)
for i in range(nb):
zs[i] = z_logistic(fan_in, sigma_w)
plot_pdensities(zs, logistic_function, derivative_logistic_function)
The same argument holds for tanh units:
def z_tanh(fan_in, sigma_w):
#a = np.random.binomial(1, 0.5, size=fan_in) * 2 - 1
a = np.random.uniform(-1., 1., size=fan_in)
w = np.random.uniform(-sigma_w, sigma_w, size=fan_in)
return np.sum(a * w)
sigma_w = 1.
for i in range(nb):
zs[i] = z_tanh(fan_in, sigma_w)
plot_pdensities(zs, np.tanh, derivative_tanh)
sigma_w = 1./np.sqrt(fan_in)
for i in range(nb):
zs[i] = z_tanh(fan_in, sigma_w)
plot_pdensities(zs, np.tanh, derivative_tanh)
"For hidden units the logistic sigmoid activation is unsuited" [Glo10] see also [Le98]
TODO: Why?? According to the derivation of the activation function the contrary should be the case.
According to [Glo10] for deep networks the $fan\_out$ should be taken into consideration, too.
For tanh: $$ W_{ij} \sim U [ - \frac{6}{fan\_in + fan\_out}, \frac{6}{fan\_in + fan\_out}] $$
For the logistic function: $$ W_{ij} \sim U [ - \frac{4 * 6}{fan\_in + fan\_out}, \frac{4 * 6}{fan\_in + fan\_out}] $$
import theano
def init_W_b(W, b, n_in, n_out):
if W is None:
W_values = numpy.asarray(
rng.uniform(
low=-np.sqrt(6./(n_in + n_out)),
high=np.sqrt(6./(n_in + n_out)),
size=(n_in, n_out)
),
dtype=theano.config.floatX
)
W = theano.shared(value=W_values, name='W', borrow=True)
# init biases to appropriate values depending on the activation functions
#
#
if b is None:
b_values = numpy.ones((n_out,), dtype=theano.config.floatX) * np.cast[theano.config.floatX](0.01)
b = theano.shared(value=b_values, name='b', borrow=True)
return W, b
Martens [Mar10] proposes another initialization scheme: "limit the number of non-zero incoming connection weights to each unit".
import random
def init_W_b(W, b, n_in, n_out, b_init=0., nb_sparse=15):
if W is None:
W_values = np.zeros(shape=(n_in, n_out))
nb_sparse = min(n_in, nb_sparse)
for i in xrange(n_out):
non_zero_indexes = random.sample(xrange(n_in), nb_sparse)
W_values[non_zero_indexes, i] = np.random.normal(size=(len(non_zero_indexes),))
W = theano.shared(value=W_values, name='W', borrow=True)
# init biases to appropriate values depending on the activation functions
# 0 or
# 0.5 for tanh units
if b is None:
b_values = np.zeros((n_out,), dtype=theano.config.floatX) + np.cast[theano.config.floatX](b_init)
b = theano.shared(value=b_values, name='b', borrow=True)
return W, b
W, b = init_W_b(None, None, n_in=40, n_out=5)