James Gregson's Website

Jun 14, 2021

Least Squares & MLE

Given a linear model \(y_i = x_i^T \beta\), we collect \(N\) observations \(\tilde{y_i} = y_i + \epsilon_i\) where \(\epsilon_i \sim \mathcal{N}(0,\sigma^2)\). The likelihood of any one of our observations \(\tilde{y_i}\) given current estimates of our parameters \(\beta\) is then:

\begin{equation*} \newcommand{\exp}[1]{{\mbox{exp}\left( #1 \right)}} \newcommand{\log}[1]{{\mbox{log}\left( #1 \right)}} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} P(\tilde{y}_i | \beta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{-\frac{1}{2}\left(\frac{e_i}{\sigma}\right)^2} = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{-\frac{1}{2}\left(\frac{\tilde{y_i}-x_i^T \beta}{\sigma}\right)^2} \end{equation*}

and the likelihood of all of them occuring is:

\begin{equation*} P(\tilde{y} | \beta) = \prod_{i=1}^N P(\tilde{y}_i | \beta ) \end{equation*}

The parameters most likely to explain the observations, given only the observations and definition of the model, are those that maximize this likelihood. Equivalently, we can find the parameters that minimize the negative log-likelihood since the two have the same optima:

\begin{align*} \beta^* &=& \argmax_\beta P(\tilde{y} | \beta) \\ &=& \argmin_\beta - \ln( P(\tilde{y} | \beta ) \\ &=& \argmin_\beta - N\log{\frac{1}{\sqrt{2\pi\sigma^2}}} - \log{ \prod_{i=1}^N P(\tilde{y}_i | \beta ) } \end{align*}

The benefit of this is that the terms decouple in the case of normally distributed residuals. Ditching constant factors and collecting the product terms into the exponent the above becomes:

\begin{align*} \beta^* &=& \argmin_\beta -\ln\left( \exp{-\frac{1}{2}\sum_{i=1}^N \left(\frac{\tilde{y}_i - M_i^T \beta}{\sigma}\right)^2} \right) \\ &=& \argmin_\beta \frac{1}{2\sigma^2} \sum_{i=1}^N \left( \tilde{y}_i - M_i^T \beta \right)^2 \end{align*}

which gives the least-squares version.

Least Squares & MAP

We can also add a prior for the parameters \(\beta\), e.g. \(\beta_i \sim \mathcal{N}(0,\sigma_\beta)\). In this case, each component of \(\beta\) is independent. The MAP estimate for \(\beta\) is then:

\begin{align*} \beta^* &=& \argmax_\beta P(\tilde{y} | \beta) P(\beta) \\ &=& \argmax_\beta \prod_i P(\tilde{y}_i | \beta) \prod_j P(\beta_j) \\ &=& \argmax_\beta \prod_i \frac{1}{\sqrt{2\pi\sigma^2}}\exp{ -\frac{1}{2}\left(\frac{e_i}{\sigma}\right)^2} \prod_j \frac{1}{\sqrt{2\pi\sigma_\beta^2}}\exp{ -\frac{1}{2}\left(\frac{\beta_j}{\sigma_\beta}\right)^2 } \end{align*}

The same negative log trick can then be used to convert products to sums, also ditching constants, to get:

\begin{align*} \beta^* &=& \argmin \frac{1}{2\sigma^2} \sum_i e_i^2 + \frac{1}{2\sigma_\beta^2} \sum_j \beta_j^2 \\ &=& \argmin \frac{1}{2\sigma^2} \sum_i \left( \tilde{y}_i - x_i^T \beta \right)^2 + \frac{1}{2\sigma_\beta^2} \sum_j \beta_j^2 \end{align*}

Using different priors will of course give different forms for the second term, e.g. choosing components of \(\beta\) to follow a Laplace distribution \(\beta_j \sim \frac{\lambda}{2} \exp{-\lambda |\beta_j|}\), gives:

\begin{equation*} \beta^* = \argmin \frac{1}{2\sigma^2} \sum_i \left( \tilde{y}_i - x_i^T \beta \right)^2 + \lambda \sum_j |\beta_j| \end{equation*}
posted at 00:00  ·   ·  Math  MLE  MAP