On the principle of learning MSE

1 minute read

Updated:

We usually use mean square error (MSE) as our machine learning softwares’ objective. What makes it valid objective function?

What machine learning does

Let our model be a function that maps an input $ X $ to an output $ Y $.

Then, this objective can be represented as modeling $ P(Y | X ; \theta ) $, where $\theta$ is a parameter that we learn.

To put it in another way, we are learning parameter $\theta$ that maximizes the probability of output $Y$ is provided given the input $X$.

This objective is often referred as Maximum Likelihood Estimate. Formally:

$\theta_{ML} = argmax P(Y | X ; \theta )$

Where MSE comes from

So here, we would like to find $\theta$ that maximizes $ P(Y | X ; \theta ) $.

Let’s introduce other assumptions, as follows:

  • Assumption 1. Samples are iid.

  • Assumption 2. $ P(y|x) ~ \frac{1}{sqrt{2\pi}\sigma} e^{-frac{y-yhat}{2\sigma^2}} $

From those assumptions, our objective becomes,

$ \theta_{ML} = argmax P(Y | X ; \theta ) $

$ = argmax \prod_{i=1}^{m}P(Y | X ; \theta ) $

$ = argmax \sum_{i=1}^{m}logP(Y | X ; \theta ) $

$ = argmax -mlog\sigma -\frac{1}{2}log2\pi -\sum_{i=1}^{m} \frac{(yhat-y)^2}{2\sigma^2} $

$ = argmax -\sum_{i=1}^{m} \frac{(yhat-y)^2}{2\sigma^2} $

To convert it into minimization objective,

$ = argmin \sum_{i=1}^{m} \frac{(yhat-y)^2}{2\sigma^2} $

Hence, our conclusion here is that,

Minimizing MSE Loss is equivalent to maximizing log likelihood.

Leave a comment