Generalized Linear Models

Lecture

2023-10-04

Practice problem: regression

You are given pairs of data \((x_i, y_i)\) where \(x_i\) is the number of vehicles on the road and \(y_i\) is the Air Quality Index (AQI). We model their relationship as \[ \begin{align} y_i &\sim N(\mu_i, \sigma) \\ \mu_i &= \alpha_i + \beta x_i \end{align} \] where \(\alpha_i\) and \(\beta\) are parameters to be estimated.

Write down the log likelihood for a single data point \(\log p(y_i | x_i)\)
Write down the log likelihood for the entire dataset \(\log p(\mathbf{y} | \mathbf{x})\)

For reference, the Normal PDF is \[ f(x | \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2 \sigma^2}\right) \]

Generalized Linear Models

Today

Generalized Linear Models
Binomial regression
Poisson regression
Wrapup

Linear regression

We have recently seen models that look like \[ \begin{align} y_i &\sim N(\mu_i, \sigma) \\ \mu_i &= \alpha_i + \beta x_i \end{align} \] Or (using another notation) \[ \begin{align} y_i &= \alpha_i + \beta x_i + \epsilon_i \\ \epsilon_i &\sim N(0, \sigma) \end{align} \]

Why linear?

\(y = ax + b\) is a strong assumption, not always physically justifiable, though often useful.

Another nice way to think about linear models is that they are Taylor series representations of functions. Michael Betancourt has an excellent and thorough case study.

Motivation: depth-damage

In lab 3, we studied the distribution of flood losses in a neighborhood.

What if we wanted to condition this distribution on variables describing flood characteristics and/or household risk management practices (as in Rözer et al., 2019)?
Regression lets us condition estimates on covariates
But we can’t use linear regression here, because the response variable has support \((0, 1)\)

Today

Generalized Linear Models extend the concept of regression to other distributions – specifically when the conditional likelihood is not Normal.

Binomial regression

Today

Generalized Linear Models
Binomial regression
Poisson regression
Wrapup

Motivation

Consider a forest patch where we have recorded the occurrence of forest fires over several years. For each year, we have also noted the average summertime temperature. We want to investigate if there’s a relationship between the average summertime temperature and the likelihood of a forest fire occurring.

Data

Code

avg_temp = [
    20.5,
    21.3,
    22.7,
    23.4,
    21.8,
    24.1,
    20.9,
    22.5,
    23.8,
    24.6,
    21.1,
    22.3,
    23.5,
    24.0,
    22.8,
    23.9,
    21.4,
    20.7,
    23.2,
    22.9,
]
forest_fire_occurred = [1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1]
scatter(
    avg_temp,
    forest_fire_occurred;
    label=false,
    xlabel="Average summertime temperature (deg C)",
    ylabel="Forest fire occurred",
    legend=:topleft
)

Likelihood

For each data point, we can use a Bernoulli distribution to model the occurrence of a forest fire \[ y_i \sim \mathrm{Bernoulli}(p_i) \] where the Bernoulli PDF is \[ f(x | p) = p^x (1 - p)^{1 - x} \]

Data dependence

We have \(y_i \sim \mathrm{Bernoulli}(p_i)\). We want to model \(p_i\) as some function of the average summertime temperature \(x_i\): \[ f(p_i)= \alpha + \beta x_i \]

This looks a lot like the linear regression model from before, except:

The likelihood is Binomial rather than Normal. No big deal – we’ve seen that we can use arbitrary probability distributions to model processes we’re interested in
We have \(f(p_i)\) rather than just \(\mu_i\). This is called a link function. GLMs need a link function to map the linear space of \(\alpha + \beta x_i \in (-\infty, \infty)\) onto the allowed space of the parameter.

Link function

For our Binomial model, \(p_i \in (0, 1)\)
- we need something to that maps \((-\infty, \infty) \rightarrow (0, 1)\).
Many possible options
The “canonical” one is a logit link:

\[ \begin{align} \textrm{logit}(p_i) &= \alpha + \beta x_i \\ \log \frac{p_i}{1 - p_i} &= \alpha + \beta x_i \\ p_i &= \frac{\exp(\alpha + \beta x_i)}{1 + \exp(\alpha + \beta x_i)} \end{align} \]

Probit link

This isn’t the only possible link function, though. For example, economists like to use the probit link – the Probit is the inverse of a standard Nomal distribution. These can be subtly different:

Code

inv_probit(x) = cdf(Normal(0, 1), x)

plot(
    logistic,
    -5,
    5;
    label="Logit",
    xlabel=L"$\alpha + \beta x_i$",
    ylabel=L"$p_i = f(\alpha + \beta x_i)$",
    legend=:topleft
)
plot!(inv_probit; label="Probit")

Inference I

1@model function logistic_regression(y::AbstractVector, x::AbstractVector)
    α ~ Normal(0, 1)
    β ~ Normal(0, 1)
2    for i in eachindex(y)
        p = logistic(α + β * x[i])
3        y[i] ~ Bernoulli(p)
    end
end

1: Here we are saying that x and y have to be vectors, but we don’t care what kind of vector (e.g. Vector{Float64}, Vector{Int}, etc.)
2: This is a more robust way to write for i in 1:length(y)
3: inv_logit is defined above.

Inference II

Code

logistic_chn = let
    model = logistic_regression(forest_fire_occurred, avg_temp)
    sampler = externalsampler(DynamicHMC.NUTS())
    nsamples = 10_000
1    rng = Random.MersenneTwister(1041)
    sample(rng, model, sampler, nsamples; drop_warmup=true)
end
plot(logistic_chn)

1: This sets the random number generator seed so that we get the same results every time we run this code. This is useful for reproducibility, but you don’t need to do this in your own code!

Poisson regression

Today

Generalized Linear Models
Binomial regression
Poisson regression
Wrapup

Motivation

Imagine a national park where we’ve recorded the number of wildlife sightings over several months. For each month, we also have the average number of visitors. We want to investigate if there’s a relationship between the average number of visitors and the number of wildlife sightings.

Data

Code

avg_visitors = [
    50, 55, 52, 58, 60, 53, 57, 59, 54, 56, 51, 61, 62, 63, 64, 65, 66, 67, 68, 69
]
wildlife_sightings = [4, 2, 4, 4, 9, 10, 4, 4, 5, 6, 6, 8, 6, 12, 13, 8, 19, 15, 13, 10]
scatter(
    avg_visitors,
    wildlife_sightings;
    label=false,
    xlabel="Average number of visitors",
    ylabel="Wildlife sightings",
    legend=:topleft
)

Likelihood

For each data point, we can use a Poisson distribution to model the number of wildlife sightings: \[ y_i \sim \mathrm{Poisson}(\lambda_i) \] where the Poisson PMF is \[ f(k | \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}. \]

Data dependence

We have \(y_i \sim \mathrm{Poisson}(\lambda_i)\). We want to model \(\lambda_i\) as some function of the average number of visitors \(x_i\): \[ f(\lambda_i) = \alpha + \beta x_i \] We need \(\lambda_i > 0\) for the Poisson distribution. The canonical link function is \(\log\).

Inference I

@model function poisson_regression(y::AbstractVector, x::AbstractVector)
    # priors
    α ~ Normal(0, 5)
    β ~ Normal(0, 5)

    # likelihood
1    λ = @. exp(α + β * x)
    return y .~ Poisson.(λ)
end

1: @. means all the operations to the right use dot syntax – this is equivalent to exp.(α .+ β .* x).

Inference II

Code

poiss_chn = let
    model = poisson_regression(wildlife_sightings, avg_visitors)
    sampler = externalsampler(DynamicHMC.NUTS())
    nsamples = 10_000
    rng = Random.MersenneTwister(1112)
    sample(rng, model, sampler, nsamples; drop_warmup=true)
end
plot(poiss_chn)

Posterior check

To our scatter plot, we can add the posterior predictive distribution, which we will visualize as percentiles

Recall: we want \(\mathbb{E}[f(\lambda_i)]\)
Here: \(f\) will be some percentile of the Poisson distribution with parameter \(\lambda_i\)
For each value we want to predict at, we need to compute this percentile for each posterior sample and then average
- Why can’t we average the values of \(\lambda\) and then compute the percentile?

Posterior check II

Code

# define the values of "avg number of visitors" we want to predict at
x_pred = range(40, 75; length=100)

# posterior distributions
α = poiss_chn[:α]
β = poiss_chn[:β]

# get a Matrix of Poisson distributions indexed [MCMC sample index, x index]
λ = hcat([exp.(α .+ β .* xi) for xi in x_pred]...)
dists = Poisson.(λ)

# select the percentiles to plot
percentiles = [5, 25, 50, 75, 95]

# create the base plot
p = plot(;
    xlabel="Average number of visitors", ylabel="Wildlife sightings", legend=:topleft
)

# add percentile lines
for pct in percentiles
    line = vec(mean(quantile.(dists, pct / 100); dims=1))
    plot!(p, x_pred, line; label="$pct%", color=:gray, linestyle=:dash)
end

# add the observations on top
scatter!(p, avg_visitors, wildlife_sightings; label="Observed", color=:blue)

Wrapup

Today

Generalized Linear Models
Binomial regression
Poisson regression
Wrapup

Other common models

Negative Binomial regression: \(y_i \sim \textrm{NegBin}(\mu_i, r)\) with \(\log(\mu_i)= \alpha + \beta x_i\)
“Robust” regression: \(y_i = \alpha + \beta x_i + \epsilon_i\) with \(\epsilon_i \sim \textrm{T}^\nu(\sigma)\) (\(\nu=4\) is common)
Lots more

Simple interfaces

TuringGLM
- still a work in progress
- convert a formula @formula(y ~ x1 + x2 + x3) into a Turing model
Inspired by
- BRMS in R, using Stan backend
- bambi in Python, using PyMC3 backend

These tools can be useful for fast model building, but when you write final versions of your results you should make sure you know what priors, data re-scaling, etc. you have used!

Summary

Working with non-Gaussian likelihoods is pretty straightforward

Need a “link function”
Can use our same software tools to get MLE or Bayesian estimates

McElreath (2020) offers useful workflow suggestions:

Use sensitivity analysis
Do prior predictive checks (see chapter 11 for an example with Poisson regression)

Logistics

Lab 5 due next Monday
Project 1 to be posted soon (builds on lab 5)
Exam 1 revisions due Friday in class

Some optional reading:

Chapters 10 and 11 of McElreath (2020) (in particular section 10.2)
- This book spends a lot of time talking about Maximum Entropy likelihoods, which we won’t worry about much.
Chapter 16 of Gelman et al. (2014)
- Because this comes towards the end of the book, it builds on a fair bit of content we haven’t yet seen

References

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2014). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC Boca Raton, FL, USA.

McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan (Second edition.). Boca Raton ; CRC Press, Taylor & Francis Group.

Rözer, V., Kreibich, H., Schröter, K., Müller, M., Sairam, N., Doss-Gollin, J., et al. (2019). Probabilistic models significantly reduce uncertainty in Hurricane Harvey pluvial flood loss estimates. Earth’s Future, 7(4). https://doi.org/10.1029/2018ef001074

Generalized Linear Models

Practice problem: regression

Generalized Linear Models

Linear regression

Why linear?

Motivation: depth-damage

Today

Binomial regression

Motivation

Data

Likelihood

Data dependence

Link function

Probit link

Inference I

Inference II

Poisson regression

Motivation

Data

Likelihood

Data dependence

Inference I

Inference II

Posterior check

Posterior check II

Wrapup

Other common models

Simple interfaces

Summary

Logistics

Read more

References