Some notes on training deep neural networks

As readers would know, the goal of training a deep neural network model is to arrive at the optimal model parameter values that minimize a loss function. The optimization algorithm that forms the basis for most (not all ?) training algorithms is the gradient descent algorithm. The gradient descent algorithm works by changing each model parameter (a weight or a bias) in a direction opposite to the loss function’s gradient with respect to the parameter.

If $L$ is the loss function, the equation to update a weight parameter $w$ at time $t$ is given by:

$w_t = w_{t-1} - \eta\frac{\partial L}{\partial w}$

where $\eta$ is a hyperparameter called the learning rate.

With the above equation for parameter update as background, we discuss considerations to achieve convergence in training. Some of these considerations have led to development of variants of gradient descent algorithm such as RMSProp and Adam.

Consideration 1: The learning rate is typically changed during training. While a high learning rate value may be used in the beginning to take large jumps in parameter values, the learning rate value is decayed as training progresses so that training does not take too big a jump to go past a minimum.

Consideration(s) 2: Considerations such as transforming the input vectors so that each vector component has zero mean and unit variance across the whole training set, random initialization of weight parameters, initialization of values for a neuron used in proportion to the fan-in, and using PCA for preprocessing are discussed in Geoffrey Hinton’s lecture.

Consideration 3: Different parameters will show different gradients. Parameters with high values of gradients will take big jumps and will train faster, while parameters with small values of gradients will take small (often negligible) jumps and will barely train. In order to socialize training across all parameters, a variant of gradient descent algorithm called RMSProp divides the learning rate by RMS value of the gradient. This makes the learning rate adapt to each parameter depending on the magnitude of the gradient history for that parameter. (There was RProp algorithm for full-batch gradient descent that just used the sign of gradient rather than the magnitude, so that the gradient magnitude differences between parameters become irrelevant. RMSProp algorithm is an adaptation of RProp to mini-batch gradient descent. See Geoffrey Hinton’s slides mentioned above).

The update equations for parameter $w$ in the RMSProp algorithm are:

$\begin{aligned} r_t &= (1-\gamma)r_{t-1} + \gamma \left(\frac{\partial L}{\partial w}\right)^2 \\ w_t &= w_{t-1} - \frac{\eta}{\sqrt{r_t}+\epsilon}\frac{\partial L}{\partial w}\end{aligned}$

where $\gamma$ is a hyperparameter (takes values less than $1$ ) governing the RMS calculation, and $\epsilon$ is a small constant to avoid division by zero.

Consideration 4: The primary rule of the gradient descent algorithm is to update a parameter value using the gradients. While it is nice to have the gradients guide us consistently towards a minimum, the gradients are often inconsistent. While the gradient in one step may suggest to update a parameter in one direction, the gradient in the next step may suggest to update the parameter in the opposite direction. So, the consideration here is to follow directions suggested by consistent (possibly, small) gradients rather than inconsistent (possibly, large) gradients. This led to the notion of momentum, wherein we use the history of gradients in the parameter update rather than just the current gradient. This consideration can also help to jump out of tiny local minima along the way to a deeper minimum.

The update equations for parameter $w$ when using momentum would be:

$\begin{aligned} \Delta w_t &= \alpha\Delta w_{t-1}-\eta \frac{\partial L}{\partial w} \\ w_t &= w_{t-1} + \Delta w_t \end{aligned}$

where $\alpha$ is the momentum hyperparameter (takes values less than $1$ ).

(Note that, $\Delta w$ is calculated as a geometric average. Usually, if we are to geometrically average $-\eta\frac{\partial L}{\partial w}$ , we would expect to see the equation $\Delta w_t = \alpha \Delta w_{t-1} - (1-\alpha)\eta\frac{\partial L}{\partial w}$ . But, we do not see the $(1-\alpha)$ factor on the second term in the equation above. Nevertheless, $\Delta w$ is still a geometric average, but one that is scaled by $\frac{1}{1-\alpha}$ .)

Consideration 5: Can we combine the ideas of RMSProp and momentum? Basically, we would like to give opportunity to all parameters to train even though some may have tiny gradients (RMSProp), and we would like to follow consistent, possibly small, gradients (momentum). While Geoffrey Hinton’s slides say that they have not seen success with this at that time, the Adam optimizer combines both the ideas, and includes some more enhancements. Adam often serves as the optimizer of choice for training deep neural networks.