In a machine learning training set up, we are given training samples and we attempt to fit a model structure
to the training samples. The training mechanism consists of reducing a cost function across the parameter space for the model structure
. A common cost function that is used is the least mean square (LMS) function, namely
. Why is this particular choice of cost function used? Of course, it uses the absolute error between the model value
and the actual target sample value
. But, we can well imagine other cost functions such as
that do the same. So, why is LMS often used as a cost function?
Note that, before the training process, while the structure of model such as linear, quadratic or some function composition as in a neural network is known or assumed, the values of parameters of the model structure are unknown. The goal of the training is to determine the values of the model structure parameters that identify the actual model. We write to indicate the model obtained by parameterizing the model structure
by values
.
The use of LMS has a justification in a probabilistic model.
Say, indeed, there is an underlying model that governs the generation of samples. Because there would be errors in sampling the sample space, we assume
are generated by (conditional?) random variables
that have Normal distribution
– that is, normally distributed with mean
and variance
. We assume the random variables are independent and identically distributed (IID).
Thus, we have the probability density .
Because of the IID assumption of the random variables, the joint(?) distribution of the random vector given
(??) is obtained by multiplying the individual probabilities. Thus,
By law of large numbers(?), the joint probability of the samples drawn will be the highest (in comparison with other collections of samples). Also, the probability of the drawn training samples as calculated above using the underlying model parameter values
will be higher than the probability calculated using any other parameter values.
Considering the above probability calculation as a function of , we call it the likelihood function
. And, the second statement in the above paragraph is called the maximum likelihood principle (?). We use this principle in training, and determine the unknown
as the values that maximize the likelihood
.
For ease of calculation, instead of maximizing the probability, we maximize the (an increasing function) of the probability. We call
, the log likelihood
.
Thus, finding that maximizes the likelihood
is same as
that minimizes the LMS cost function
.