Justification of LMS cost function in a probabilistic model

In a machine learning training set up, we are given training samples $\{(x_i,y_i)\}_{i=1\cdots m}$ and we attempt to fit a model structure $y = h(x)$ to the training samples. The training mechanism consists of reducing a cost function across the parameter space for the model structure $y = h(x)$ . A common cost function that is used is the least mean square (LMS) function, namely $\sum\limits_{i=1}^m \left(y_i - h(x_i)\right)^2$ . Why is this particular choice of cost function used? Of course, it uses the absolute error between the model value $h(x_i)$ and the actual target sample value $y_i$ . But, we can well imagine other cost functions such as $\sum\limits_{i=1}^m |y_i - h(x_i)|$ that do the same. So, why is LMS often used as a cost function?

Note that, before the training process, while the structure of model such as linear, quadratic or some function composition as in a neural network is known or assumed, the values of parameters of the model structure are unknown. The goal of the training is to determine the values of the model structure parameters that identify the actual model. We write $y = h_\theta(x)$ to indicate the model obtained by parameterizing the model structure $y = h(x)$ by values $\theta$ .

The use of LMS has a justification in a probabilistic model.

Say, indeed, there is an underlying model $y = h_\theta(x)$ that governs the generation of samples. Because there would be errors in sampling the sample space, we assume $\{(x_i, y_i)\}_{i\cdots m}$ are generated by (conditional?) random variables $Y_i|X_i=x_i$ that have Normal distribution $\mathcal{N}\left(h_\theta(x_i),\sigma^2\right)$ – that is, normally distributed with mean $h_\theta(x_i)$ and variance $\sigma^2$ . We assume the random variables are independent and identically distributed (IID).

Thus, we have the probability density $p(Y_i|X_i = x_i) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(Y_i - h_\theta(x_i))^2}{2\sigma^2}}$ .

Because of the IID assumption of the random variables, the joint(?) distribution of the random vector $\vec{Y} = (Y_1, Y_2,\cdots,Y_m)$ given $X_1 = x_1, X_2 = x_2, \cdots, X_m=x_m$ (??) is obtained by multiplying the individual probabilities. Thus,

$\begin{aligned} p\left(\vec{Y}|\vec{X}\right) &= \prod\limits_{i=1}^{m}p(Y_i|X_i=x_i) \\ &= \prod\limits_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(Y_i-h_\theta(x_i))^2}{2\sigma^2}}\end{aligned}$

By law of large numbers(?), the joint probability of the samples $\{(x_i, y_i)\}_{i=1\cdots m}$ drawn will be the highest (in comparison with other collections of samples). Also, the probability of the drawn training samples as calculated above using the underlying model parameter values $\theta$ will be higher than the probability calculated using any other parameter values.

Considering the above probability calculation as a function of $\theta$ , we call it the likelihood function $L(\theta)$ . And, the second statement in the above paragraph is called the maximum likelihood principle (?). We use this principle in training, and determine the unknown $\theta$ as the values that maximize the likelihood $L(\theta)$ .

For ease of calculation, instead of maximizing the probability, we maximize the $\log$ (an increasing function) of the probability. We call $\log L(\theta)$ , the log likelihood $l(\theta)$ .

$\begin{aligned} l(\theta) &= \log L(\theta) \\ &= \log \prod\limits_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i-h_\theta(x_i))^2}{2\sigma^2}} \\ &= \sum\limits_{i=1}^m\log\left(\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i-h_\theta(x_i))^2}{2\sigma^2}}\right) \\ &= \sum\limits_{i=1}^m\log\left(\frac{1}{\sqrt{2\pi}\sigma}\right)+\sum\limits_{i=1}^m \log \left(e^{-\frac{(y_i-h_\theta(x_i))^2}{2\sigma^2}}\right) \\ &= -m\log\sqrt{2\pi}\sigma -\frac{1}{2\sigma^2}\sum\limits_{i=1}^{m}\left(y_i-h_\theta(x_i)\right)^2\end{aligned}$

Thus, finding $\theta$ that maximizes the likelihood $L(\theta)$ is same as $\theta$ that minimizes the LMS cost function $\sum\limits_{i=1}^m\left(y_i-h_\theta(x_i)\right)^2$ .