Momentum and Nesterov Momentum

Classical Momentum

TODO

$$ \Delta \theta_{t+1} = \mu \Delta \theta_{t} + \alpha \nabla f(\theta_t) $$

$$ \theta_{t+1} = \theta_{t} + \Delta \theta_{t} $$

with

the gradient of the function $f$ at point $\theta_t$: $\nabla f(\theta_t)$
The change of the parameter $\theta_t$: $\Delta \theta_t$
learning rate $\alpha$
momentum parameter $\mu$

Nesterov Momentum

TODO [Sut13]:

$$ \Delta \theta_{t+1} = \mu \Delta \theta_{t} + \alpha \nabla f(\theta_t + \mu \Delta \theta_t) $$

$$ \theta_{t+1} = \theta_{t} + \Delta \theta_{t} $$

Implementation in Theano

TODO

In []:

def get_updates(params, learning_rate, momentum, nesterov, consider_constant)
    # momentum and nesterov momentum
    updates = []
    for param in params:
      param_update = theano.shared(param.get_value()*np.cast[theano.config.floatX](0.))  
      updates.append((param, param + param_update))
      if nesterov: # nesterov momentum
          eval_param = param + momentum * param_update
      else: # classical momentum
          eval_param = param      
      updates.append((param_update, momentum*param_update - learning_rate * T.grad(cost, eval_param, consider_constant=consider_constant)))          
    return updates

TODO

Literature:

[Ben13] Bengio J., Boulanger-Lewandowski N. and Pascanu R.: Advances in Optimizing Recurrent Networks, International Conference on Acustics, Speech and Signal Processing (ICASSP) 2013
[Nes83] Nesterov Y. A method of solving a convex programming problem with convergence rate O(1/srt(k)), Soviet Mathematics Doklady, 1983
[Sut13] Sutskever, I., Martens, J., Dahl, G. and Hinton, G. E.:On the importance of momentum and initialization in deep learning. In 30th International Conference on Machine Learning, JMLR 2013