Back-propagation = Dynamic programming + Calculus chain rule

In Neural networks, we have a stack of layers of computations that takes input values and produces an output. Mathematically, this can be viewed as a hierarchy of function compositions where each nesting level of function composition corresponds to a neural network layer.

For example, $O = f_n\left(W_n, B_n, f_{n-1}\left(W_{n-1}, B_{n-1}, f_{n-2}\left(\cdots f_1(W_1, B_1, f_0(W_0, B_0, I)\right)\right)\right)$ is a Neural network with $n+1$ layers. Here, $I$ represents input vector, $O$ represents the output vector and $f_k$ represents the $k$ th layer, with layer $0$ being the input layer and layer $n$ being the output layer. $W_k$ and $B_k$ are the weights and biases respectively of $k$ th layer. Each layer $f_k$ is a function of the output from previous layer $f_{k-1}$ and its weights and biases. Note that each layer produces a vector output. In particular, the final output of the network $O$ is a vector.

When a Neural network is designed (number of layers in the hierarchy of function compositions, the specific functions used at each layer, the dimension of output vectors at different layers), we do not yet know the values of parameters of the network namely its weights and biases. We merely know that so many weights and so many biases will be used in such fashion. The process of determining the weights using the example input/output pairs is called training the network.

Once the Neural network is trained, it can be used to infer the output given an input vector that is not one of the example inputs used in training. This is called inference.

In training Neural networks, the goal is to minimize an objective function $C$ , which is a function is a function of the output vector $O$ . Since training is based on example input/output data that is already available, the input $I$ is treated as constants, and the only variables for the objective function are the weights and biases at each layer. In inference though, the weights and biases would have already been determined, and we are trying to determine $O$ given a new value of $I$ . Hence, in inference, $W_k$ and $B_k$ are constants, while $I$ is the only variable. Thus, the role, namely that of variables or constants, played by weights, biases, input and output changes between training and inference, nevertheless we presented the function composition hierarchy $O = f_n\left(W_n, B_n, f_{n-1}\left(W_{n-1}, B_{n-1}, f_{n-2}\left(\cdots f_1(W_1, B_1, f_0(W_0, B_0, I)\right)\right)\right)$ showing all weights, biases, input and output as variables just to capture the network structure.

In training, a common approach to minimize the objective function $C$ is to apply Gradient Descent algorithm (or one of its variants such as Stochastic Gradient Descent). The Gradient Descent algorithm requires the calculations of gradients (partial derivatives) of the $C$ with respect to the variables (parameters) weights and biases.

Now, how do we calculate gradients of $C$ with respect to the parameters? Since $C$ is a function of $O$ , $C$ itself is a hierarchy of function compositions. This, leads to use the chain rule from Calculus, namely given $h(x) = f(g(x))$ , we have $\frac{dh}{dx} = \frac{df}{dg}\frac{dg}{dx}$ . That is, for example, the partial derivative of $C$ with respect to $W_k$ can be determined by first determining the partial derivative of $C$ with respect to $f_k$ and the partial derivative of $f_k$ with respect to $W_k$ . This idea is exactly what leads to back-propagation algorithm.

What is Dynamic Programming? What is its relation to back-propagation? It is hard to see the meaning of either of the words Dynamic or Programming in the way Dynamic Programming is used in today’s computer science algorithms. In fact, the origin of the name has an obscure history. In any case, the essence of Dynamic Programming is that a big problem can be solved by solving smaller but similar sub-problems first and then combining the solutions of the sub-problems. A common scenario where Dynamic Programming is used is where the problem can be represented as a graph or a tree (??), and solution across the entire graph can be obtained by first solving across one or more sub-graphs.

[Ths idea of breaking a large problem into sub-problems is also used by computer scientists in divide-and-conquer algorithms (sorting, Discrete Fourier Transform), often implemented using recursion with or without memoization. Dynamic programming is closely related to recursion with memoization. There are however some differences between dynamic programming and recursion with memoization. Dynamic programming goes about solving the smaller sub-problems first assuming their results are needed for solving the bigger sub-problems, tabulates the results as it goes, and uses the results as required to solve bigger sub-problems. Dynamic programming goes bottom-up from the smallest sub-problems to the original problem. Since dynamic programming solves smaller sub-problems without first evaluating if a particular smaller sub-problem actually figures in the solution of the original problem, some of its effort will go unused if results of certain sub-problems are never actually used. In contrast, recursion with memoization is a top-down approach which starts by expressing the original problem in terms of its sub-problems. Nevertheless, in recursion, like dynamic programming, the sub-problems do get solved first before the bigger problems get solved. However, since only those sub-problems that figure in the expression for the original problem are sought out, no effort spent on the sub-problems goes unused. If recursion solves only the sub-problems that are needed, why use dynamic programming at all in preference to recursion? Recursion incurs additional overheads, so dynamic programming is employed when it is known that almost all sub-problems of the given problem need to be solved.]

In back-propagation, the chain rule seems to be all there is. Where does Dynamic Programming even come in? Well, to find the partial derivatives of $C$ with respect to $f_k$ (or anything nested within), you have to first find the partial derivative of $C$ with respect to $f_{k+1}$ . Thus, finding the partial derivative of $C$ with respect to $f_{k+1}$ is the sub-problem to be solved first in order to solve the problem of finding the partial derivative of $C$ with respect to $f_k$ .