Lecture
2023-09-06
We have some parametric statistical model with unknown parameters. We want to evaluate how consistent the data are with different values of the parameters, and to find the values of the parameters that are most consistent with the data.
Today’s lecture contains a few key ideas and a lot of slides.
Today
Likelihood
Likelihood of multiple data points
Poisson example
Multivariate example
Maximum likelihood estimation
Regression example
Wrapup
The likelihood is the probability of the data given some parameters: \[ p(y | \theta) \]
Often, we want to study how the likelihood changes for different values of \(\theta\), holding \(y\) fixed. This is just \(p(y | \theta)\) for a range of \(\theta\).
Note
You will sometimes see this referred to as \(\mathcal{L}(\theta)\), which is a confusing notation…
We can plot \(p(y | \theta)\) for different values of \(\theta\). To do that, we need a function for \(p(y | \theta)\). We will consider \(y \sim \mathcal{N}(\mu, \sigma)\) so \(\theta = \left\{ \mu, \sigma \right\}\).
::T
are called “type annotations” and specify the type of variable that each argument can take. In this case, any Real
(float or integer) will work. Read more in the docs.pdf
functionNext we plug in some values for \(\mu\) and plot the likelihood for each. This is essentially plotting the likelihood as a function of \(\theta\) for fixed \(y\).
lik.
means to apply the function lik
to each element of μ_try
. [lik(xi, 1, 2) for xi in x]
would do the same thing.Today
Likelihood
Likelihood of multiple data points
Poisson example
Multivariate example
Maximum likelihood estimation
Regression example
Wrapup
Independent and identically distributed (i.i.d.) assumption: \[ \begin{align} p(y_1, y_2, \ldots, y_n) &= p(y_1) p(y_2) \times \ldots \times p(y_n)\\ &= \prod_{i=1}^n p(y_i) \end{align} \]
Usually we have more than one data point. Say we measure \(y = y_1, y_2, \ldots, y_n\): \[ p(y | \theta) = \prod_{i=1}^n p(y_i | \theta) \]
Recall: \(\log(AB) = \log(A) + \log(B)\) or, more generally, \[ \log \left( \prod_{i=1}^n f_i \right) = \sum_{i=1}^n \log(f_i) \]
Thus, we can work with the “log likelihood”: \[ \log p(y | \theta) = \log \left( \prod_{i=1}^n p(y_i | \theta) \right) = \sum_{i=1}^n \log \left( p(y_i | \theta) \right) \]
Adding small numbers is more numerically stable than multiplying them
We can extend our previous example to multiple data points. As before, we need a likelihood function.
Vector{<:Real}
means a vector of any subtype of Real
. Julia uses “multiple dispatch” which means that we can have multiple functions with the same name but that do different things depending on what the type of the arguments is.logpdf
function (from Distributions
) is the log of the pdf. Here log_liks
will be a vector with the same length as y
.As before, we can plot the likelihood as a function of \(\mu\) for fixed \(y\) and \(\sigma\).
lik.
won’t work – it doesn’t know which variable to vectorize over.Today
Likelihood
Likelihood of multiple data points
Poisson example
Multivariate example
Maximum likelihood estimation
Regression example
Wrapup
We collect \(y_1, \ldots, y_n\) which are the number of tropical cyclones that make landfall in the continental United States in a given year. We decide to model them as a Poisson distribution with unknown rate \(\lambda\): \[ p(y_i | \lambda) = \frac{\lambda^{y_i} e^{-\lambda}}{y_i!} \]## Total log likelihood
logpdf.(Poisson(λ))
. Performance differences are usually negligible, focus on readability.Today
Likelihood
Likelihood of multiple data points
Poisson example
Multivariate example
Maximum likelihood estimation
Regression example
Wrapup
Let’s say we don’t know for sure that \(\sigma = 1\). In that case our mathmatical model for \(p(y | \mu, \sigma)\) is unchanged from the single-variable case.
Let’s write a function for the log likelihood
With two parameters, we need to plot a surface
z = [f(x, y) for xi in x, yi in y]
will produce a matrix with z[i, j] = f(x[i], y[j])
.lik_plot
to get the correct orientation. Here, lik_plot'
is equivalent to transpose(lik_plot)
.st=:heatmap
tells Plots to plot a heatmap. We could also try :surface or :contourf.Notice that there is a very small region for which the likelihood is [relatively] high.
Today
Likelihood
Likelihood of multiple data points
Poisson example
Multivariate example
Maximum likelihood estimation
Regression example
Wrapup
Can we find the parameters \(\theta^*\) that maximize the likelihood \(p(y | \theta)\)?
We can use the log likelihood \(\log p(y | \theta)\) instead of the likelihood \(p(y | \theta)\).
The log likelihood is monotonic with the likelihood, so \[ \arg \max \log p(y | \theta) = \arg \max p(y | \theta) \]
Solving things analytically takes time up-front, but can be much faster to run because you can avoid the optimization step. Consider the (potentially multivariate) Gaussian example with known covariance matrix \(\Sigma\). We want to maximize the likelihood \[ \sum_{i=1}^n p(y_i | \mu, \Sigma) \]
To maximize, we set its derivative with respect to \(\mu\), which we’ll denote with \(\nabla_\mu\), to zero: \[ \sum_{i=1}^n \nabla_\mu \log p(y_i | \mu, \Sigma) = 0 \]
Substituting in the multivariate Gaussian likelihood we get: \[ \begin{aligned} 0 & =\sum_{i=1}^n \nabla_\mu \log \frac{1}{\sqrt{(2 \pi)^d|\Sigma|}} \exp \left(-\frac{1}{2}\left(x_i-\mu\right)^{\top} \Sigma^{-1}\left(x_i-\mu\right)\right) \\ & =\sum_{i=1}^n \nabla_\mu\left(\log \left(\frac{1}{\sqrt{(2 \pi)^d|\Sigma|}}\right)\right)+\log \left(\exp \left(-\frac{1}{2}\left(x_i-\mu\right)^{\top} \Sigma^{-1}\left(x_i-\mu\right)\right)\right) \\ & =\sum_{i=1}^n \nabla_\mu\left(-\frac{1}{2}\left(x_i-\mu\right)^{\top} \Sigma^{-1}\left(x_i-\mu\right)\right)\\ &=\sum_{i=1}^n \Sigma^{-1}\left(x_i-\mu\right) \\ 0 &= \sum_{i=1}^n (x_i - \mu) \\ \mu &= \frac{1}{n} \sum_{i=1}^n x_i \end{aligned} \]
Note
You are not expected to remember the above equations and I won’t ask you to do this derivation in a time-constrained exam. You should understand the general procedure:
We can use the optimize
function from the Optim.jl
package to find the maximum likelihood estimate. First, we need to define the function to be optimized. optimize
will minimize the function, so we need to define the negative log likelihood. We’ll call this the “loss” function.
θ
but it doesn’t matter what we call it.Now we can run the optimization. Since \(\sigma > 0\) always, we will pass along bounds. We could alternatively do something clever like work with \(\log \sigma\) instead of \(\sigma\).
2-element Vector{Float64}:
1.97000000014951
0.7590125165166877
Inf
We could convert this to a Distributions
object as
Normal{Float64}(μ=1.97000000014951, σ=0.7590125165166877)
Today
Likelihood
Likelihood of multiple data points
Poisson example
Multivariate example
Maximum likelihood estimation
Regression example
Wrapup
Let’s consider the generic regression probelem where we have paired observations \(\left\{x_i, y_i\right\}_{i=1}^n\). In general, we can write this regression as \[ y_i | \alpha, \beta, \epsilon \sim \mathcal{N}(\alpha + x_i \beta, \sigma^2) \] where \(x_i\) and \(\beta\) may be vectors.
We can create some raw data (click to “unfold” the code)
We can use the same approach to derive the maximum likelihood estimate for linear regression:
If you want a walkthrough, see Ryan Adams’s lecture notes starting at about equation 11.
As before, we need to write down a (log) likelihood function
3-element Vector{Float64}:
-3.147
2.392
1.562
If we work through the math, we can show that the log likelihood for the linear regression problem is \[ \log p(y | X, \beta, \sigma) = \frac{N}{2} \log (2 \sigma^2 \pi) - \frac{1}{2 \sigma^2} \left( X \beta - y \right)^T \left( X \beta - y \right) \]
Linear algebra notation
There is no intercept here! This is a common notation and assumes that the first column of \(X\) is all ones. That is equivalent to writing down an intercept, but lets us use linear algebra notation and keep track of fewer variable names
From this, we can show that terms drop out and \[ \beta^\text{MLE} = \arg \min_\beta \left( X \beta - y \right)^T \left( X \beta - y \right) \] which is exactly the least squares problem (minimize squared error): \[ \min_{\theta} \sum_{i=1}^n (y_i - y_i^\text{pred})^2 \]
Key point
“Least squares can be interpreted as assuming Gaussian noise, and particular choices of likelihood can be interpreted directly as (usually exponentiated) loss functions” –Adams
If we then want to estimate \(\sigma\), we can estimate the standard deviation of the residuals.
Today
Likelihood
Likelihood of multiple data points
Poisson example
Multivariate example
Maximum likelihood estimation
Regression example
Wrapup
Many people get this backwards!