Lecture

2023-09-06

We have some *parametric statistical model* with unknown parameters. We want to evaluate how consistent the data are with different values of the parameters, and to find the values of the parameters that are most consistent with the data.

Today’s lecture contains a few key ideas and a lot of slides.

- Lots of examples
- We will move quickly through examples
- Review the examples yourself and ask questions on Canvas
- Lab will build on the examples (use them as a reference!)

Today

Likelihood

Likelihood of multiple data points

Poisson example

Multivariate example

Maximum likelihood estimation

Regression example

Wrapup

The likelihood is the probability of the **data** given some **parameters**: \[
p(y | \theta)
\]

Often, we want to study how the likelihood changes for different values of \(\theta\), holding \(y\) fixed. This is just \(p(y | \theta)\) for a range of \(\theta\).

**Note**

You will sometimes see this referred to as \(\mathcal{L}(\theta)\), which is a confusing notation…

We can plot \(p(y | \theta)\) for different values of \(\theta\). To do that, we need a function for \(p(y | \theta)\). We will consider \(y \sim \mathcal{N}(\mu, \sigma)\) so \(\theta = \left\{ \mu, \sigma \right\}\).

- The
`::T`

are called “type annotations” and specify the type of variable that each argument can take. In this case, any`Real`

(float or integer) will work. Read more in the docs. - This specifies the likelihood using the
`pdf`

function

Next we plug in some values for \(\mu\) and plot the likelihood for each. This is essentially plotting the likelihood as a function of \(\theta\) for fixed \(y\).

- The vector notation
`lik.`

means to apply the function`lik`

to each element of`μ_try`

.`[lik(xi, 1, 2) for xi in x]`

would do the same thing. - Notice the likelihood function is maximized at \(\mu = y\).

Today

Likelihood

Likelihood of multiple data points

Poisson example

Multivariate example

Maximum likelihood estimation

Regression example

Wrapup

Independent and identically distributed (i.i.d.) assumption: \[ \begin{align} p(y_1, y_2, \ldots, y_n) &= p(y_1) p(y_2) \times \ldots \times p(y_n)\\ &= \prod_{i=1}^n p(y_i) \end{align} \]

Usually we have more than one data point. Say we measure \(y = y_1, y_2, \ldots, y_n\): \[ p(y | \theta) = \prod_{i=1}^n p(y_i | \theta) \]

Recall: \(\log(AB) = \log(A) + \log(B)\) or, more generally, \[ \log \left( \prod_{i=1}^n f_i \right) = \sum_{i=1}^n \log(f_i) \]

Thus, we can work with the “log likelihood”: \[ \log p(y | \theta) = \log \left( \prod_{i=1}^n p(y_i | \theta) \right) = \sum_{i=1}^n \log \left( p(y_i | \theta) \right) \]

Adding small numbers is more numerically stable than multiplying them

We can extend our previous example to multiple data points. As before, we need a likelihood function.

`Vector{<:Real}`

means a vector of any subtype of`Real`

. Julia uses “multiple dispatch” which means that we can have multiple functions with the same name but that do different things depending on what the type of the arguments is.`logpdf`

function (from`Distributions`

) is the log of the pdf. Here`log_liks`

will be a vector with the same length as`y`

.- Add up all the log likelihoods then take the exponent – equvalent to the product of the likelihoods.

As before, we can plot the likelihood as a function of \(\mu\) for fixed \(y\) and \(\sigma\).

- In this case both \(\mu\) and \(y\) are vectors, with different lengths, so using the dot notation
`lik.`

won’t work – it doesn’t know which variable to vectorize over.

Today

Likelihood

Likelihood of multiple data points

Poisson example

Multivariate example

Maximum likelihood estimation

Regression example

Wrapup

We collect \(y_1, \ldots, y_n\) which are the number of tropical cyclones that make landfall in the continental United States in a given year. We decide to model them as a Poisson distribution with unknown rate \(\lambda\): \[ p(y_i | \lambda) = \frac{\lambda^{y_i} e^{-\lambda}}{y_i!} \]## Total log likelihood

- Take the log \[ \log p(y_i | \lambda) = y_i \log(\lambda) - \lambda - \log(y_i!) \]
- For multiple data points \[ \log p(y | \lambda) = \sum_{i=1}^n y_i \log(\lambda) - n \lambda - \sum_{i=1}^n \log(y_i!) \]

- I am deliberately showing different ways to implement the same thing in different examples. Here we use a list comprehension. We could instead use, for example, vector notation:
`logpdf.(Poisson(λ))`

. Performance differences are usually negligible, focus on readability.

Today

Likelihood

Likelihood of multiple data points

Poisson example

Multivariate example

Maximum likelihood estimation

Regression example

Wrapup

Let’s say we don’t know for sure that \(\sigma = 1\). In that case our mathmatical model for \(p(y | \mu, \sigma)\) is unchanged from the single-variable case.

Let’s write a function for the *log* likelihood

With two parameters, we need to plot a surface

- This syntax:
`z = [f(x, y) for xi in x, yi in y]`

will produce a matrix with`z[i, j] = f(x[i], y[j])`

. - Due to a quirk of syntax, we need to transpose the matrix
`lik_plot`

to get the correct orientation. Here,`lik_plot'`

is equivalent to`transpose(lik_plot)`

. `st=:heatmap`

tells Plots to plot a heatmap. We could also try :surface or :contourf.

Notice that there is a very small region for which the likelihood is [relatively] high.

Today

Likelihood

Likelihood of multiple data points

Poisson example

Multivariate example

Maximum likelihood estimation

Regression example

Wrapup

Can we find the parameters \(\theta^*\) that maximize the likelihood \(p(y | \theta)\)?

We can use the log likelihood \(\log p(y | \theta)\) instead of the likelihood \(p(y | \theta)\).

The log likelihood is monotonic with the likelihood, so \[ \arg \max \log p(y | \theta) = \arg \max p(y | \theta) \]

Solving things analytically takes time up-front, but can be much faster to run because you can avoid the optimization step. Consider the (potentially multivariate) Gaussian example with known covariance matrix \(\Sigma\). We want to *maximize* the likelihood \[
\sum_{i=1}^n p(y_i | \mu, \Sigma)
\]

To maximize, we set its derivative with respect to \(\mu\), which we’ll denote with \(\nabla_\mu\), to zero: \[ \sum_{i=1}^n \nabla_\mu \log p(y_i | \mu, \Sigma) = 0 \]

Substituting in the multivariate Gaussian likelihood we get: \[ \begin{aligned} 0 & =\sum_{i=1}^n \nabla_\mu \log \frac{1}{\sqrt{(2 \pi)^d|\Sigma|}} \exp \left(-\frac{1}{2}\left(x_i-\mu\right)^{\top} \Sigma^{-1}\left(x_i-\mu\right)\right) \\ & =\sum_{i=1}^n \nabla_\mu\left(\log \left(\frac{1}{\sqrt{(2 \pi)^d|\Sigma|}}\right)\right)+\log \left(\exp \left(-\frac{1}{2}\left(x_i-\mu\right)^{\top} \Sigma^{-1}\left(x_i-\mu\right)\right)\right) \\ & =\sum_{i=1}^n \nabla_\mu\left(-\frac{1}{2}\left(x_i-\mu\right)^{\top} \Sigma^{-1}\left(x_i-\mu\right)\right)\\ &=\sum_{i=1}^n \Sigma^{-1}\left(x_i-\mu\right) \\ 0 &= \sum_{i=1}^n (x_i - \mu) \\ \mu &= \frac{1}{n} \sum_{i=1}^n x_i \end{aligned} \]

**Note**

You are not expected to remember the above equations and I won’t ask you to do this derivation in a time-constrained exam. You should understand the general procedure:

- write down log likelihood for all data points
- write down likelihood for one data point
- write down log likelihood for one data point
- sum over all data points

- take \(\frac{d}{d\theta}\) and set equal to zero to maximize
- solve for \(\theta^*\).

We can use the `optimize`

function from the `Optim.jl`

package to find the maximum likelihood estimate. First, we need to define the function to be optimized. `optimize`

will minimize the function, so we need to define the *negative* log likelihood. We’ll call this the “loss” function.

- Note that this function takes in
**a single argument**which is a vector of parameters. We’ll call this vector`θ`

but it doesn’t matter what we call it.

Now we can run the optimization. Since \(\sigma > 0\) always, we will pass along bounds. We could alternatively do something clever like work with \(\log \sigma\) instead of \(\sigma\).

```
2-element Vector{Float64}:
1.97000000014951
0.7590125165166877
```

- The lower bound is actually zero, but we just set it to a “pretty small” number.
- The upper bound is infinity, we can pass in
`Inf`

- We need to pass in a guess for the parameters. We’ll just use \(\mu = \sigma = 1\).
- This will actually run the optimization
- This will extract the parameters that minimize the loss function.

We could convert this to a `Distributions`

object as

`Normal{Float64}(μ=1.97000000014951, σ=0.7590125165166877)`

Today

Likelihood

Likelihood of multiple data points

Poisson example

Multivariate example

Maximum likelihood estimation

Regression example

Wrapup

Let’s consider the generic regression probelem where we have paired observations \(\left\{x_i, y_i\right\}_{i=1}^n\). In general, we can write this regression as \[ y_i | \alpha, \beta, \epsilon \sim \mathcal{N}(\alpha + x_i \beta, \sigma^2) \] where \(x_i\) and \(\beta\) may be vectors.

We can create some raw data (click to “unfold” the code)

We can use the same approach to derive the maximum likelihood estimate for linear regression:

- Write the likelihood for one data point
- Write the log likelihood for one data point
- Write the log likelihood for all data points
- Take \(\frac{d}{d\theta}\) and set equal to zero to maximize

If you want a walkthrough, see Ryan Adams’s lecture notes starting at about equation 11.

As before, we need to write down a (log) likelihood function

```
3-element Vector{Float64}:
-3.147
2.392
1.562
```

- This is the loss function, which is the negative log likelihood. The first value of \(\theta\) is the intercept, the second is the slope, and the third is the standard deviation.
- We need to pass in bounds for the parameters. The standard deviation is always positive, so we set the lower bound to a small number.
- This will extract the parameters that minimize the loss function and round to show three decimal places.

If we work through the math, we can show that the log likelihood for the linear regression problem is \[ \log p(y | X, \beta, \sigma) = \frac{N}{2} \log (2 \sigma^2 \pi) - \frac{1}{2 \sigma^2} \left( X \beta - y \right)^T \left( X \beta - y \right) \]

**Linear algebra notation**

There is no intercept here! This is a common notation and assumes that the first column of \(X\) is all ones. That is equivalent to writing down an intercept, but lets us use linear algebra notation and keep track of fewer variable names

From this, we can show that terms drop out and \[ \beta^\text{MLE} = \arg \min_\beta \left( X \beta - y \right)^T \left( X \beta - y \right) \] which is exactly the least squares problem (minimize squared error): \[ \min_{\theta} \sum_{i=1}^n (y_i - y_i^\text{pred})^2 \]

**Key point**

“Least squares can be interpreted as assuming Gaussian noise, and particular choices of likelihood can be interpreted directly as (usually exponentiated) loss functions” –Adams

If we then want to estimate \(\sigma\), we can estimate the standard deviation of the residuals.

Today

Likelihood

Likelihood of multiple data points

Poisson example

Multivariate example

Maximum likelihood estimation

Regression example

Wrapup

Many people get this backwards!

- The likelihood is the probability of the data given the parameters: \(p(y | \theta)\).
- We often plot the likelihood for many different \(\theta\)
- \(p(y | \theta)\) for many different \(\theta\)

- Don’t confuse this with the posterior, which is the probability of the parameters given the data: \(p(\theta | y)\)

- Friday:
- Lab 03 in class – look for GH Classroom link on Canvas
- Lab 02 due

- Next week:
- Bayesian inference