Likelihood and maximum likelihood estimation


Lecture

2023-09-06

Motivation

We have some parametric statistical model with unknown parameters. We want to evaluate how consistent the data are with different values of the parameters, and to find the values of the parameters that are most consistent with the data.

Today

Today’s lecture contains a few key ideas and a lot of slides.

  • Lots of examples
  • We will move quickly through examples
  • Review the examples yourself and ask questions on Canvas
  • Lab will build on the examples (use them as a reference!)

Likelihood

Today

  1. Likelihood

  2. Likelihood of multiple data points

  3. Poisson example

  4. Multivariate example

  5. Maximum likelihood estimation

  6. Regression example

  7. Wrapup

Definition

The likelihood is the probability of the data given some parameters: \[ p(y | \theta) \]

Often, we want to study how the likelihood changes for different values of \(\theta\), holding \(y\) fixed. This is just \(p(y | \theta)\) for a range of \(\theta\).

Note

You will sometimes see this referred to as \(\mathcal{L}(\theta)\), which is a confusing notation…

Likelihood example

We can plot \(p(y | \theta)\) for different values of \(\theta\). To do that, we need a function for \(p(y | \theta)\). We will consider \(y \sim \mathcal{N}(\mu, \sigma)\) so \(\theta = \left\{ \mu, \sigma \right\}\).

  1. The ::T are called “type annotations” and specify the type of variable that each argument can take. In this case, any Real (float or integer) will work. Read more in the docs.
  2. This specifies the likelihood using the pdf function

Next we plug in some values for \(\mu\) and plot the likelihood for each. This is essentially plotting the likelihood as a function of \(\theta\) for fixed \(y\).

  1. The vector notation lik. means to apply the function lik to each element of μ_try. [lik(xi, 1, 2) for xi in x] would do the same thing.
  2. Notice the likelihood function is maximized at \(\mu = y\).

Likelihood of multiple data points

Today

  1. Likelihood

  2. Likelihood of multiple data points

  3. Poisson example

  4. Multivariate example

  5. Maximum likelihood estimation

  6. Regression example

  7. Wrapup

IID assumption

Independent and identically distributed (i.i.d.) assumption: \[ \begin{align} p(y_1, y_2, \ldots, y_n) &= p(y_1) p(y_2) \times \ldots \times p(y_n)\\ &= \prod_{i=1}^n p(y_i) \end{align} \]

Usually we have more than one data point. Say we measure \(y = y_1, y_2, \ldots, y_n\): \[ p(y | \theta) = \prod_{i=1}^n p(y_i | \theta) \]

Log trick

Recall: \(\log(AB) = \log(A) + \log(B)\) or, more generally, \[ \log \left( \prod_{i=1}^n f_i \right) = \sum_{i=1}^n \log(f_i) \]

Thus, we can work with the “log likelihood”: \[ \log p(y | \theta) = \log \left( \prod_{i=1}^n p(y_i | \theta) \right) = \sum_{i=1}^n \log \left( p(y_i | \theta) \right) \]

Adding small numbers is more numerically stable than multiplying them

Numerical example: multiple data points

We can extend our previous example to multiple data points. As before, we need a likelihood function.

  1. Vector{<:Real} means a vector of any subtype of Real. Julia uses “multiple dispatch” which means that we can have multiple functions with the same name but that do different things depending on what the type of the arguments is.
  2. logpdf function (from Distributions) is the log of the pdf. Here log_liks will be a vector with the same length as y.
  3. Add up all the log likelihoods then take the exponent – equvalent to the product of the likelihoods.

As before, we can plot the likelihood as a function of \(\mu\) for fixed \(y\) and \(\sigma\).

  1. In this case both \(\mu\) and \(y\) are vectors, with different lengths, so using the dot notation lik. won’t work – it doesn’t know which variable to vectorize over.

Poisson example

Today

  1. Likelihood

  2. Likelihood of multiple data points

  3. Poisson example

  4. Multivariate example

  5. Maximum likelihood estimation

  6. Regression example

  7. Wrapup

Setup

We collect \(y_1, \ldots, y_n\) which are the number of tropical cyclones that make landfall in the continental United States in a given year. We decide to model them as a Poisson distribution with unknown rate \(\lambda\): \[ p(y_i | \lambda) = \frac{\lambda^{y_i} e^{-\lambda}}{y_i!} \]## Total log likelihood

  1. Take the log \[ \log p(y_i | \lambda) = y_i \log(\lambda) - \lambda - \log(y_i!) \]
  2. For multiple data points \[ \log p(y | \lambda) = \sum_{i=1}^n y_i \log(\lambda) - n \lambda - \sum_{i=1}^n \log(y_i!) \]
  1. I am deliberately showing different ways to implement the same thing in different examples. Here we use a list comprehension. We could instead use, for example, vector notation: logpdf.(Poisson(λ)). Performance differences are usually negligible, focus on readability.

Plot

Multivariate example

Today

  1. Likelihood

  2. Likelihood of multiple data points

  3. Poisson example

  4. Multivariate example

  5. Maximum likelihood estimation

  6. Regression example

  7. Wrapup

Setup

Let’s say we don’t know for sure that \(\sigma = 1\). In that case our mathmatical model for \(p(y | \mu, \sigma)\) is unchanged from the single-variable case.

Let’s write a function for the log likelihood

Plotting

With two parameters, we need to plot a surface

  1. This syntax: z = [f(x, y) for xi in x, yi in y] will produce a matrix with z[i, j] = f(x[i], y[j]).
  2. Due to a quirk of syntax, we need to transpose the matrix lik_plot to get the correct orientation. Here, lik_plot' is equivalent to transpose(lik_plot).
  3. st=:heatmap tells Plots to plot a heatmap. We could also try :surface or :contourf.

Notice that there is a very small region for which the likelihood is [relatively] high.

Maximum likelihood estimation

Today

  1. Likelihood

  2. Likelihood of multiple data points

  3. Poisson example

  4. Multivariate example

  5. Maximum likelihood estimation

  6. Regression example

  7. Wrapup

Logic

Can we find the parameters \(\theta^*\) that maximize the likelihood \(p(y | \theta)\)?

Log likelihood

We can use the log likelihood \(\log p(y | \theta)\) instead of the likelihood \(p(y | \theta)\).

The log likelihood is monotonic with the likelihood, so \[ \arg \max \log p(y | \theta) = \arg \max p(y | \theta) \]

Analytic solution

Solving things analytically takes time up-front, but can be much faster to run because you can avoid the optimization step. Consider the (potentially multivariate) Gaussian example with known covariance matrix \(\Sigma\). We want to maximize the likelihood \[ \sum_{i=1}^n p(y_i | \mu, \Sigma) \]

To maximize, we set its derivative with respect to \(\mu\), which we’ll denote with \(\nabla_\mu\), to zero: \[ \sum_{i=1}^n \nabla_\mu \log p(y_i | \mu, \Sigma) = 0 \]

Substituting in the multivariate Gaussian likelihood we get: \[ \begin{aligned} 0 & =\sum_{i=1}^n \nabla_\mu \log \frac{1}{\sqrt{(2 \pi)^d|\Sigma|}} \exp \left(-\frac{1}{2}\left(x_i-\mu\right)^{\top} \Sigma^{-1}\left(x_i-\mu\right)\right) \\ & =\sum_{i=1}^n \nabla_\mu\left(\log \left(\frac{1}{\sqrt{(2 \pi)^d|\Sigma|}}\right)\right)+\log \left(\exp \left(-\frac{1}{2}\left(x_i-\mu\right)^{\top} \Sigma^{-1}\left(x_i-\mu\right)\right)\right) \\ & =\sum_{i=1}^n \nabla_\mu\left(-\frac{1}{2}\left(x_i-\mu\right)^{\top} \Sigma^{-1}\left(x_i-\mu\right)\right)\\ &=\sum_{i=1}^n \Sigma^{-1}\left(x_i-\mu\right) \\ 0 &= \sum_{i=1}^n (x_i - \mu) \\ \mu &= \frac{1}{n} \sum_{i=1}^n x_i \end{aligned} \]

Note

You are not expected to remember the above equations and I won’t ask you to do this derivation in a time-constrained exam. You should understand the general procedure:

  1. write down log likelihood for all data points
    1. write down likelihood for one data point
    2. write down log likelihood for one data point
    3. sum over all data points
  2. take \(\frac{d}{d\theta}\) and set equal to zero to maximize
  3. solve for \(\theta^*\).

Numerical approach I

We can use the optimize function from the Optim.jl package to find the maximum likelihood estimate. First, we need to define the function to be optimized. optimize will minimize the function, so we need to define the negative log likelihood. We’ll call this the “loss” function.

  1. Note that this function takes in a single argument which is a vector of parameters. We’ll call this vector θ but it doesn’t matter what we call it.

Numerical approach II

Now we can run the optimization. Since \(\sigma > 0\) always, we will pass along bounds. We could alternatively do something clever like work with \(\log \sigma\) instead of \(\sigma\).

2-element Vector{Float64}:
 1.97000000014951
 0.7590125165166877
  1. The lower bound is actually zero, but we just set it to a “pretty small” number.
  2. The upper bound is infinity, we can pass in Inf
  3. We need to pass in a guess for the parameters. We’ll just use \(\mu = \sigma = 1\).
  4. This will actually run the optimization
  5. This will extract the parameters that minimize the loss function.

We could convert this to a Distributions object as

Normal{Float64}(μ=1.97000000014951, σ=0.7590125165166877)

Regression example

Today

  1. Likelihood

  2. Likelihood of multiple data points

  3. Poisson example

  4. Multivariate example

  5. Maximum likelihood estimation

  6. Regression example

  7. Wrapup

Overview

Let’s consider the generic regression probelem where we have paired observations \(\left\{x_i, y_i\right\}_{i=1}^n\). In general, we can write this regression as \[ y_i | \alpha, \beta, \epsilon \sim \mathcal{N}(\alpha + x_i \beta, \sigma^2) \] where \(x_i\) and \(\beta\) may be vectors.

We can create some raw data (click to “unfold” the code)

Anlytic approach

We can use the same approach to derive the maximum likelihood estimate for linear regression:

  1. Write the likelihood for one data point
  2. Write the log likelihood for one data point
  3. Write the log likelihood for all data points
  4. Take \(\frac{d}{d\theta}\) and set equal to zero to maximize

If you want a walkthrough, see Ryan Adams’s lecture notes starting at about equation 11.

Numerical optimization I

As before, we need to write down a (log) likelihood function

Numerical optimization II

3-element Vector{Float64}:
 -3.147
  2.392
  1.562
  1. This is the loss function, which is the negative log likelihood. The first value of \(\theta\) is the intercept, the second is the slope, and the third is the standard deviation.
  2. We need to pass in bounds for the parameters. The standard deviation is always positive, so we set the lower bound to a small number.
  3. This will extract the parameters that minimize the loss function and round to show three decimal places.

Parallel: least squares

If we work through the math, we can show that the log likelihood for the linear regression problem is \[ \log p(y | X, \beta, \sigma) = \frac{N}{2} \log (2 \sigma^2 \pi) - \frac{1}{2 \sigma^2} \left( X \beta - y \right)^T \left( X \beta - y \right) \]

Linear algebra notation

There is no intercept here! This is a common notation and assumes that the first column of \(X\) is all ones. That is equivalent to writing down an intercept, but lets us use linear algebra notation and keep track of fewer variable names

From this, we can show that terms drop out and \[ \beta^\text{MLE} = \arg \min_\beta \left( X \beta - y \right)^T \left( X \beta - y \right) \] which is exactly the least squares problem (minimize squared error): \[ \min_{\theta} \sum_{i=1}^n (y_i - y_i^\text{pred})^2 \]

Key point

“Least squares can be interpreted as assuming Gaussian noise, and particular choices of likelihood can be interpreted directly as (usually exponentiated) loss functions” –Adams

If we then want to estimate \(\sigma\), we can estimate the standard deviation of the residuals.

Wrapup

Today

  1. Likelihood

  2. Likelihood of multiple data points

  3. Poisson example

  4. Multivariate example

  5. Maximum likelihood estimation

  6. Regression example

  7. Wrapup

Don’t get it twisted

Many people get this backwards!

  • The likelihood is the probability of the data given the parameters: \(p(y | \theta)\).
  • We often plot the likelihood for many different \(\theta\)
    • \(p(y | \theta)\) for many different \(\theta\)
  • Don’t confuse this with the posterior, which is the probability of the parameters given the data: \(p(\theta | y)\)

Logistics

  • Friday:
    • Lab 03 in class – look for GH Classroom link on Canvas
    • Lab 02 due
  • Next week:
    • Bayesian inference

References