Generalized Linear Models


Lecture

2023-10-04

Practice problem: regression

You are given pairs of data \((x_i, y_i)\) where \(x_i\) is the number of vehicles on the road and \(y_i\) is the Air Quality Index (AQI). We model their relationship as \[ \begin{align} y_i &\sim N(\mu_i, \sigma) \\ \mu_i &= \alpha_i + \beta x_i \end{align} \] where \(\alpha_i\) and \(\beta\) are parameters to be estimated.

  1. Write down the log likelihood for a single data point \(\log p(y_i | x_i)\)
  2. Write down the log likelihood for the entire dataset \(\log p(\mathbf{y} | \mathbf{x})\)

For reference, the Normal PDF is \[ f(x | \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2 \sigma^2}\right) \]

Generalized Linear Models

Today

  1. Generalized Linear Models

  2. Binomial regression

  3. Poisson regression

  4. Wrapup

Linear regression

We have recently seen models that look like \[ \begin{align} y_i &\sim N(\mu_i, \sigma) \\ \mu_i &= \alpha_i + \beta x_i \end{align} \] Or (using another notation) \[ \begin{align} y_i &= \alpha_i + \beta x_i + \epsilon_i \\ \epsilon_i &\sim N(0, \sigma) \end{align} \]

Why linear?

\(y = ax + b\) is a strong assumption, not always physically justifiable, though often useful.

Another nice way to think about linear models is that they are Taylor series representations of functions. Michael Betancourt has an excellent and thorough case study.

Motivation: depth-damage

In lab 3, we studied the distribution of flood losses in a neighborhood.

  • What if we wanted to condition this distribution on variables describing flood characteristics and/or household risk management practices (as in Rözer et al., 2019)?
  • Regression lets us condition estimates on covariates
  • But we can’t use linear regression here, because the response variable has support \((0, 1)\)

Today

Generalized Linear Models extend the concept of regression to other distributions – specifically when the conditional likelihood is not Normal.

Binomial regression

Today

  1. Generalized Linear Models

  2. Binomial regression

  3. Poisson regression

  4. Wrapup

Motivation

Consider a forest patch where we have recorded the occurrence of forest fires over several years. For each year, we have also noted the average summertime temperature. We want to investigate if there’s a relationship between the average summertime temperature and the likelihood of a forest fire occurring.

Data

Code
avg_temp = [
    20.5,
    21.3,
    22.7,
    23.4,
    21.8,
    24.1,
    20.9,
    22.5,
    23.8,
    24.6,
    21.1,
    22.3,
    23.5,
    24.0,
    22.8,
    23.9,
    21.4,
    20.7,
    23.2,
    22.9,
]
forest_fire_occurred = [1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1]
scatter(
    avg_temp,
    forest_fire_occurred;
    label=false,
    xlabel="Average summertime temperature (deg C)",
    ylabel="Forest fire occurred",
    legend=:topleft
)

Likelihood

For each data point, we can use a Bernoulli distribution to model the occurrence of a forest fire \[ y_i \sim \mathrm{Bernoulli}(p_i) \] where the Bernoulli PDF is \[ f(x | p) = p^x (1 - p)^{1 - x} \]

Data dependence

We have \(y_i \sim \mathrm{Bernoulli}(p_i)\). We want to model \(p_i\) as some function of the average summertime temperature \(x_i\): \[ f(p_i)= \alpha + \beta x_i \]

This looks a lot like the linear regression model from before, except:

  1. The likelihood is Binomial rather than Normal. No big deal – we’ve seen that we can use arbitrary probability distributions to model processes we’re interested in
  2. We have \(f(p_i)\) rather than just \(\mu_i\). This is called a link function. GLMs need a link function to map the linear space of \(\alpha + \beta x_i \in (-\infty, \infty)\) onto the allowed space of the parameter.

Inference I

1@model function logistic_regression(y::AbstractVector, x::AbstractVector)
    α ~ Normal(0, 1)
    β ~ Normal(0, 1)
2    for i in eachindex(y)
        p = logistic+ β * x[i])
3        y[i] ~ Bernoulli(p)
    end
end
1
Here we are saying that x and y have to be vectors, but we don’t care what kind of vector (e.g. Vector{Float64}, Vector{Int}, etc.)
2
This is a more robust way to write for i in 1:length(y)
3
inv_logit is defined above.

Inference II

Code
logistic_chn = let
    model = logistic_regression(forest_fire_occurred, avg_temp)
    sampler = externalsampler(DynamicHMC.NUTS())
    nsamples = 10_000
1    rng = Random.MersenneTwister(1041)
    sample(rng, model, sampler, nsamples; drop_warmup=true)
end
plot(logistic_chn)
1
This sets the random number generator seed so that we get the same results every time we run this code. This is useful for reproducibility, but you don’t need to do this in your own code!

Poisson regression

Today

  1. Generalized Linear Models

  2. Binomial regression

  3. Poisson regression

  4. Wrapup

Motivation

Imagine a national park where we’ve recorded the number of wildlife sightings over several months. For each month, we also have the average number of visitors. We want to investigate if there’s a relationship between the average number of visitors and the number of wildlife sightings.

Data

Code
avg_visitors = [
    50, 55, 52, 58, 60, 53, 57, 59, 54, 56, 51, 61, 62, 63, 64, 65, 66, 67, 68, 69
]
wildlife_sightings = [4, 2, 4, 4, 9, 10, 4, 4, 5, 6, 6, 8, 6, 12, 13, 8, 19, 15, 13, 10]
scatter(
    avg_visitors,
    wildlife_sightings;
    label=false,
    xlabel="Average number of visitors",
    ylabel="Wildlife sightings",
    legend=:topleft
)

Likelihood

For each data point, we can use a Poisson distribution to model the number of wildlife sightings: \[ y_i \sim \mathrm{Poisson}(\lambda_i) \] where the Poisson PMF is \[ f(k | \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}. \]

Data dependence

We have \(y_i \sim \mathrm{Poisson}(\lambda_i)\). We want to model \(\lambda_i\) as some function of the average number of visitors \(x_i\): \[ f(\lambda_i) = \alpha + \beta x_i \] We need \(\lambda_i > 0\) for the Poisson distribution. The canonical link function is \(\log\).​

Inference I

@model function poisson_regression(y::AbstractVector, x::AbstractVector)
    # priors
    α ~ Normal(0, 5)
    β ~ Normal(0, 5)

    # likelihood
1    λ = @. exp+ β * x)
    return y .~ Poisson.(λ)
end
1
@. means all the operations to the right use dot syntax – this is equivalent to exp.(α .+ β .* x).

Inference II

Code
poiss_chn = let
    model = poisson_regression(wildlife_sightings, avg_visitors)
    sampler = externalsampler(DynamicHMC.NUTS())
    nsamples = 10_000
    rng = Random.MersenneTwister(1112)
    sample(rng, model, sampler, nsamples; drop_warmup=true)
end
plot(poiss_chn)