A survey of papers in deep learning

Deep learning has proven to be quite effective in attacking the problems of classification (image, speech etc).

Ever since AlexNet won the ImageNet challenge in 2012, there has been an explosion of research papers that were written on deep learning. While efforts are being made to develop a sound theory to explain the success of deep learning, a lot of papers center(ed) around improving over existing neural network architectures based on intuition based “techniques”, and verifying the improvements empirically.

In this article, I list some key papers over the years that show the changes in the neural network architectures. Note that the list would be an evolving one since the field itself is evolving, and more importantly, my knowledge is evolving.

The AlexNet paper.
Oxford’s VGG nets where the idea is to have deeper networks (but using just 3×3 (small?) convolutional kernels) than AlexNet to have more expressive power.
Network in network where the idea is to have micro networks slide across feature maps in the same manner as traditional CNN’s have filter kernels sliding across.
Google inception architecture which uses the idea from the Network In Network paper to stack micro networks called inception modules on top of each other rather than traditional convolutional layers. To reduce the number of computations, they introduce 1×1 convolutions before the expensive 3×3 and 5×5 convolutional kernels. See here for details on how computations are reduced.
Microsoft’s residual networks which introduce residual connections between layers and increase the network depth to ~150 layers.
MobileNets implement “reduced” networks to achieve not-so-compromised accuracy at a lower computational cost compared to the larger networks such as inception networks. This is so that they can be deployed in compute-limited setups such as mobile phones. One way to reduce a network without changing the base network architecture is by using a reduced number of bits to store the parameters (32-bit floating point vs 8-bit fixed point representation). Rather than this, MobileNets reduce the networks by modifying the traditional convolutional layers into input-layer-wise separated convolutions and then combining the outputs of the layers using 1×1 convolutions across layers. This reduces the number of parameters and also number of computations.
In general, CNN’s have a stack of convolutional layers followed by fully connected layers. Since the convolutional layers slide a kernel filter across, the convolutional layers of a network are indifferent to the input size image. However, the fully connected layers are connected to every coordinate of the feature map, and hence require the feature map to be of size appropriate to the fully connected layer’s size. Since the input images can vary in size, this problem is generally handled by cropping the image or warping the image. Cropping has the disadvantage of losing the content, while warping has the disadvantage of introducing distortions. Spatial pyramid pooling combines a method from traditional computer vision with deep learning to introduce an SPP layer between the stack of convolutional layers and the fully connected layers. It eliminates any cropping or warping step on the input images. As opposed to a traditional max-pool layer, an SPP layer fixes the output size by proportionally adjusting the pool window to the input feature map size.
Object detection problem is different from image classification problem. Image classification is the problem where an image with one object is given, and the object is required to be classified to one of the classes. Object detection is the problem where an image with multiple objects is given, and object bounding boxes have to identified for each of the objects and each of the objects must be classified into one of the classes. Network architectures discussed in the above papers attack the problem of image classification. Fast R-CNN tries to solve the problem of object detection. A region proposal method (super-pixel based?) is run on the input image to propose bounding boxes for objects of interest. Independent of the region proposals, a stack of convolutional layers is run on the entire input image. After the stack of convolutional layers, an ROI pooling layer (a Spatial Pyramid Pooling layer) that outputs a fixed size feature map is run. Then, a sequence of fully connected layers is run that finally branch into two sibling layers, namely softmax layer and bounding box regression layer, producing two outputs – probability of class identification for each object and coordinates for the bounding boxes.
- The fast R-CNN improves over a previous R-CNN paper where first region proposal is done on input image, and separate CNN networks are each run on warped versions of each region of interest. This older method is expensive as separate CNN networks are done on each region of interest with no weight sharing.
Faster R-CNN
Voxel nets
Recurrent neural networks (RNNs) have been around for decades for analyzing sequences of data, such as a time-series where input at a time instant is correlated to input at time instants around it. RNNs take an input at each instant in the sequence, pass it through one or more hidden layers and generate an output. Where they differ from feed-forward networks is that there exist connections from each hidden layer to itself. The best way to understand the RNN structure and how they help in analyzing a correlated data sequence is to see an RNN unfolded.
- For training RNNs using gradient descent, the gradients are found using a variant of backpropagation called backpropagation through time (BPTT). The idea of the BPTT algorithm, and the problem vanishing and exploding gradients with training RNNs can be found in this paper. Essentially, BPTT algorithm is backpropagation algorithm applied on an unfolded RNN, where the gradients of the error function are calculated by backpropagating sequentially through the hidden state variables. Note that, in backpropagating on an unfolded RNN through the hidden states, the same weight values (weight matrix) are used between each pair of two successive hidden states. This in effect apparently causes vanishing or exploding gradients problem depending on the maximum Eigen value of the weight matrix. (But, what if the weight matrix, and thereby its maximum Eigen value, changes in training? Could you have alternating vanishing and exploding gradient problems?)
  - The vanishing/exploding gradients problem seems to be prevalent in deep feed-forward networks as well. As the gradients are backpropagated during training, the gradients with respect to a layer’s parameters fall and increase exponentially in relation to the layer’s distance from the output layer. This in turn causes the weights to adjust disproportionately between the layers closer to the output and the layers closer to the input, resulting in a sub-optimal settlement of training. (Is gradient descent algorithm inappropriate for optimizing deep nested functions?) The problem of vanishing and exploding gradients may be more prevalent with sigmoid activation function because of its end behavior being flat, and hence RELU activation function are often used as a technique to circumvent the problem.
- To circumvent the problem of vanishing and exploding gradients in RNNs, alternate structures were proposed for the hidden layer. Traditionally, the hidden layer had just a $\tanh$ unit. However, LSTMs (Long Short Term Memory) and their smaller cousins GRUs (Gated Recurrent Unit) are now widely used. LSTMs seem explicitly designed keep track of long-term dependencies in sequences. They employ a forget gate, an input gate and an output gate which control which parts of existing cell state must be forgotten, which part of existing cell state must be updated by current input (and another hidden state) and which part of new cell state generated propagates as the hidden state for next time instant respectively. See here for a good explanation. GRUs employ just two gates reset and update gates. GRUs and an encoder-decoder architecture for machine translation are introduced in this paper.
Reinforcement Learning provides a framework for an agent to take action in a given environment to maximize cumulative reward.
- A policy gradient algorithm called Trust Region Policy Optimization (TRPO) is given in the paper. In the paper, denotes the expected reward of policy . Thus, , where the expectation is on the initial state . Also, denotes the discounted visitation frequency for a state : , where denotes the state at time and the probabilities take into account the policy and the dynamics of the environment.
  The paper shows for two different policies $\pi$ and $\tilde{\pi}$ that $\eta\left(\tilde{\pi}\right) = \eta(\pi) + \sum\limits_s\rho_{\tilde\pi}(s)\sum\limits_a\tilde{\pi}(a|s)A_{\pi}(s,a)$ .
  
  $\eta\left(\tilde{\pi}\right)$ is approximated by $L_{\pi}\left(\tilde\pi\right) = \eta(\pi) + \sum\limits_s\rho_{\pi}(s)\sum\limits_a\tilde{\pi}(a|s)A_{\pi}(s,a)$ . The approximation comes from replacing $\rho_{\tilde{\pi}}$ in $\eta\left(\tilde{\pi}\right)$ by $\rho_{\pi}$ .
  
  Given a current policy $\pi_{curr}$ , the algorithm derives a new policy $\pi_{new} = \arg\!\max\limits_{\pi} L_{\pi_{curr}}\left(\pi\right)$ . This update clearly gives $L_{\pi_{curr}}\left(\pi_{new}\right) \ge L_{\pi_{curr}}\left(\pi_{curr}\right)$ . But, does that ensure a similar relation in actual expected rewards, namely $\eta\left(\pi_{new}\right) \ge \eta\left(\pi_{curr}\right)$ ?
  
  The paper gives the following theorem:
  
  $\eta\left(\pi_{new}\right) \ge L_{\pi_{curr}}\left(\pi_{new}\right) - \frac{4\varepsilon\gamma}{(1-\gamma)^2}D_{KL}^{\max}\left(\pi_{curr},\pi_{new}\right)$ , where $\varepsilon = \max\limits_{s,a} |A_{\pi}(s,a)|$ (should $\pi$ here be $\pi_{curr}$ ?), and $D_{KL}^{\max}\left(\pi_{curr},\pi_{new}\right) = \max\limits_s D_{KL}\left(\pi_{curr}(.|s)\,\|\,\pi_{new}(.|s)\right)$ with $D_{KL}\left(p\,\|\,q\right)$ being the KL divergence of probability measures $p$ and $q$ .
  
  Essentially, if the new policy $\pi_{new}$ does not differ too much from the current policy $\pi_{curr}$ , then $\eta\left(\pi_{new}\right)$ is guaranteed to be close to $L_{\pi_{curr}}\left(\pi_{new}\right)$ . Thus, by improving on $L_{\pi_{curr}}\left(\pi_{curr}\right)$ , we improve on $\eta\left(\pi_{curr}\right)$ .
  
  The algorithm in the paper is designed approximately around the above theory.
  
  If we use subscripted $\theta$ to indicate parameterized policies, the algorithm does the following in each step: find $\theta$ to maximize $L_{\theta_{curr}}\left(\theta\right)$ subject to constraint $D_{KL}^{\rho_{\theta_{curr}}}\left(\theta_{curr},\theta\right) \le \delta$ . Here, $D_{KL}^{\rho_{\theta_{curr}}}\left(\theta_{curr},\theta\right) = \mathbb{E}_{s\sim\rho_{\theta_{curr}}}\left[D_{KL}\left(\pi_{\theta_{curr}}(.|s)\,\|\,\pi_{\theta}(.|s)\right)\right]$ . The constraint $D_{KL}^{\rho_{\theta_{curr}}}\left(\theta_{curr},\theta\right) \le \delta$ limits the selection of the new policy in each update step to a trust region of the current policy.
  
  To adopt Monte-Carlo simulation to the above algorithm, we convert summations to expectations.
  
  We justify as follows:
  - Maximizing $L_{\theta_{curr}}\left(\theta\right)$ is same as maximizing $\mathbb{E}_{s\sim\rho_{\theta_{curr}}} \sum\limits_a\pi_{\theta}(a|s)A_{\theta_{curr}}(s,a) = \sum\limits_s \rho_{\theta_{curr}}(s)\sum\limits_a\pi_{\theta}(a|s)A_{\theta_{curr}}(s,a)$ .
  - We then replace the internal summation also by expectation and write the objective function as $\mathbb{E}_{s\sim\rho_{\theta_{curr}}, a\sim\pi_{\theta_{curr}}} \left[\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{curr}}(a|s)}A_{\theta_{curr}}(s,a)\right]$ .
  Thus, the constrained optimization problem becomes:
  
  maximize over $\theta$
  $\mathbb{E}_{s\sim\rho_{\theta_{curr}}, a\sim\pi_{\theta_{curr}}} \left[\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{curr}}(a|s)}A_{\theta_{curr}}(s,a)\right]$
  
  subject to
  
  $D_{KL}^{\rho_{\theta_{curr}}}\left(\theta_{curr},\theta\right) \le \delta$
  
  Using Monte-Carlo simulation, the algorithm replaces expectations by empirical estimates.
- Another paper proposes a PPO algorithm which has a simpler implementation than TRPO. The PPO paper proposes several variants of objective functions:
  1. Objective function with no constraint or penalty: $L^{PG}(\theta) = \mathbb{E}_t\left[\log\pi_{\theta}(a_t|s_t)A_t\right]$
  2. TRPO paper based objective function: $\mathbb{E}_t\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{curr}}(a_t|s_t)}A_t - \beta D_{KL}\left[\pi_{\theta_{curr}}(.|s_t),\pi_{\theta}(.|s_t)\right]\right]$
  3. Clipped TRPO objective function: $L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{curr}}(a_t|s_t)},\mbox{clip}\left(\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{curr}}(a_t|s_t)},1-\varepsilon,1+\varepsilon\right)\right)A_t\right]$
  4. Adaptive KL penalty objective function: $L^{KLPEN}(\theta) = \mathbb{E}_t\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{curr}}(a_t|s_t)}A_t - \beta D_{KL}\left[\pi_{\theta_{curr}}(.|s_t),\pi_{\theta}(.|s_t)\right]\right]$ , where $\beta$ is adopted periodically.
  Of the above, (3) and (4) aim to overcome the difficulty in correctly choosing the value of the hyper-parameter $\beta$ in (2). The paper claims robust performance with (3).
  
  With each of the above, an estimate of the advantage function is required as policy parameters are updated. This is typically done in an actor-critic setting where the advantage function is computed using a critic neural network, while the policy is modeled using an actor neural network. It is possible to share the parameters across the two neural networks.
  
  To estimate the advantage function, a trajectory (sequence of steps) of a certain length is run each time observing the environment for the state and the reward. The advantage is calculated as:
  
  $A_t = -V(s_t) + r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T-t+1}r_{T-1} + \gamma^{T-t}V(s_T)$ ,
  where $V(s)$ is estimated using the critic neural network.
  
  The PPO paper suggests the use of parallel actors. The algorithm is as follows:
  - Iterate over steps
    - Iterated over parallel actors
      - Run policy $\pi_{curr}$ for $T$ time steps
      - Compute advantages $A_1, A_2, \cdots, A_T$
  - Optimize surrogate objective function $L$ with respect to $\theta$ , with $K$ epochs and minibatches (or batches?)
  - $\theta_{curr} = \theta$
Attention mechanism brings a novel idea into the architecture to solve the machine translation problem. Traditionally, the encoder-decoder architecture was used for machine translation problems, wherein an encoder side employs an RNN(usually) to encode a source language sentence into a fixed length intermediate vector and the decoder side employs another RNN(usually) to generate the corresponding target language sentence. This approach forced the decoder RNN to use the same fixed length intermediate vector for all time instants. While it is possible that a target word at a time instant $t$ is more dependent than a target word at another time instant $s$ on the source word at a particular time instant $r$ , this fixed intermediate length vector encoder-decoder architecture allowed no such freedom for each time instant of target word generation. Attention mechanism takes the hidden states of the encoder RNN at each of the time instants, linearly combines these hidden states, with weights used for the linear combination derived for each target time instant separately. The weights are derived for each target time instant $t$ separately because they are a function of the hidden state of target RNN at time instant $t-1$ (as well as the hidden states of the source RNN).
Transformer improves on previous ideas by dispensing with RNN and using attention mechanisms throughout. This paper abstracts the attention mechanism with an independent definition, not tying it specifically to an encoder-decoder setup. Attention mechanism can be defined using queries, keys and values. Attention mechanism linearly combines values, where the weight used for each value is obtained as function of a query and a key. In machine translation task, this idea is used to align the source word (query) with surrounding words (keys). If the query, keys and values come from the same input stream (such as within encoder stack or decoder stack of the machine translation task), the attention is called self-attention. The architecture also employs (non-self-)attention between encoder and decoder stacks where the queries come from the decoder states and keys and values come from the encoder states. For implementation of attention mechanism, the paper uses multi-head attention wherein the several parallel attention mechanisms on carried out on lower dimensional projections of queries, keys and values, and the results of these parallel attentions are concatenated.
ELMO
BERT
GPT-1