Quantitative and Graphical Model Selection

Lecture

2023-11-15

Motivation

We have to make choices about which distribution to use, which covariates (if any) to use for nonstationarity, which (if any) parameters to model as nonstationary, how to pool information across space, etc. There is no single “right” answer; how can we proceed in a principled way?

This is a general problem in statistics beyond extreme value analysis.

Vibe checks with plots

Today

Vibe checks with plots
Just Give Me a Number

Data and fit

Code

annmax_precip = CSV.read("data/dur01d_ams_na14v11_houston.csv", DataFrame)
hobby = annmax_precip[annmax_precip.name .== "HOUSTON HOBBY AP", :]
hobby_bayes = gevfitbayes(hobby, :precip_in)
plot(
    hobby.year,
    hobby.precip_in;
    xlabel="Year",
    ylabel="Annual Max Precip (in)",
    label="HOUSTON HOBBY AP",
    marker=:circ,
)

Plot the distribution

What we plot: histogram of the data and the probability density function
Ideal case: the histogram and the PDF appear to show the same distribution
Warnings: systematic deviations
Limitations: hard to learn much about the tails of the distribution from this plot

Extremes.histplot(hobby_bayes)

Probability plot

What we plot: empirical CDF (1 - AEP) of against the fitted GEV’s CDF
Ideal case: a straight line, indicating perfect agreement between the empirical CDF ann the fitted CDF
Warnings: curvature or systematic deviations from the line, especially in the tails
Limitations: sampling uncertainty!

Extremes.probplot(hobby_bayes)

QQ plot

What we plot: quantiles (i.e., return levels) of the data against quantiles of the fitted GEV
Ideal case: a straight line through the data
Warnings: curvature or systematic deviations from the line, especially in the tails
Limitations: sampling uncertainty!

Extremes.qqplot(hobby_bayes)

With Uncertainty

Extremes.qqplotci(hobby_bayes)

All in one

Extremes.diagnosticplots(hobby_bayes)

Calibration histogram

Sometimes we want to summarize calibration across a large number of sites at once

What we plot: a histogram, where each observation is the observed quantile of the data, given the (conditional) GEV at that location / year
Ideal case: a uniform distribution
Warnings: systematic deviations from uniformity
Limitations: aggregating over sites can hide local issues

Just Give Me a Number

Today

Vibe checks with plots
Just Give Me a Number

Credit

This content borrows heavily from a literature review that I developed with Vivek Srikrishnan.

For a more accessible discussion, see Chapter 7 of McElreath (2020). For a more technical discussion, see Piironen & Vehtari (2017).

The challenge

We want to make probabilistic predictions about unobserved data \(\tilde{y}\). This is hard because Earth systems are:

high-dimensional
multi-scale
nonlinear / complex

To approximate the true system, we come up with a model space \(\mathcal{M}\) defining a family of candidate models, then use them to make predictions.

Note

This content gets fairly technical. You will not be tested on the equations, but you should understand the key points (which we will recap).

How similar are two distributions?

\[ D_\text{KL} (P \parallel Q) = \sum_{x \in \mathcal{X}} P(x) \log \left[ \frac{P(x)}{Q(x)} \right] \]

One interpretation of \(D_\text{KL} (P \parallel Q)\) is the measure of information gained by revising one’s beliefes from the prior distribution \(Q\) to the posterior distribution \(P\). Another interpretation is the amount of information lost when \(Q\) is used to approximate \(P\). Note that for continuous RVs the above sum can be written as an integral.

Measures of predictive accuracy

Predictive performance of a model defined in terms of a utility function \(u(M, \tilde{y})\). Commonly used: log predictive density: \[ \log p(\tilde{y} | D, M). \] Future observations \(\tilde{y}\) are unknown, so we must approach it in expectation: \[ \overline{u}(M) = \mathbb{E}\left[ \log p(\tilde{y} | D, M) \right] = \int p_t(\tilde{y}) \log [(\tilde{y} | D, M) d\tilde{y} \] where \(p_t(\tilde{y})\) is the true data generating distribution (unknown!)

Maximizing \(\overline{u}(M)\) is equivalent to minimizing KL divergence from candidate model \(p(\tilde{y} | D, M)\) to true data distribution \(p_t(\tilde{y})\)

In practice we work with estimates

We don’t know the true distribution \(\theta\) so we have to approximate it. The log pointwise predictive density is \[ \begin{align} \text{lppd} &= \log \prod_{i=1}^N p_\text{post}(y_i) = \sum_{i=1}^N \log \int p(y_i | \theta) p_\text{post} (\theta) d \theta \\ &\approx \sum_{i=1}^N \log \left[ \frac{1}{S} \sum_{i=1}^S p(y_i | \theta^s) \right] \end{align} \] where we have approximated the posterior with \(S\) simulations from the posterior (eg, using MCMC).

the LPPD of observed data \(y\) is an overestimate of the expected LPPD for future data. Thus tools will start with our approximate form and then derive some correction.

Model combination

We could treat “which model to use” as a parameter. If we have an exhaustive list of candidate models \(\{ M_\ell \}_{\ell=1}^L\), then the distribution over the model space is given by \[ p(M | D) \propto p(D | M) p(M) \] and we can average over them \[ p(\tilde{y} | D) = \sum_{\ell=1}^L p(\tilde{y}|D, M_\ell) p(M_\ell | D) \]

Setting up Information Criteria

If our inference on the parameters is summarized by a point estimate \(\hat{\theta}\) (e.g., the MAP estimate) then out of sample predictive accuracy is defined by \[ \text{elpd}_\hat{\theta} = \mathbb{E}_f \left[ \log p(\tilde{y} | \hat{\theta}(y)) \right] \]

AIC Criterion

If the model estimates \(k\) parameters, and if they are assumed asymptotically normal (ie a normal linear model with known variance and uniofrm prior) then fitting \(k\) parameters will increase the predictive accuracy by chance alone: \[ \hat{\text{elpd}}_\text{AIC} = \log p(y | \hat{\theta}_\text{mle}) - k \] ::: {.fragment} Thus we can define \[ \text{AIC} = 2 k - 2 \ln \hat{\mathcal{L}} \] and select the model that minimizes it. ::: ::: {.fragment} For complicated models, what is \(k\)? There are formula to approximate effective number of parameters. Note that AIC asssumes residuals are independent given \(\hat{\theta}\) :::

## DIC Criterion

Start with AIC
Replace \(\hat{\theta}_\text{mle}\) by posterior mean \(\hat{\theta}_\text{Bayes} = \mathbb{E}[\theta | y]\)
Replace \(k\) by a data-based bias correction; there are different forms

\[ \hat{\text{elpd}}_\text{DIC} = \log p(y | \hat{\theta}_\text{Bayes}) - p_\text{DIC} \] where \(p_\text{DIC}\) is derived from assumptions about the effective number of parameters. The quantity \[ \text{DIC} = -2 \log p(y | \hat{\theta}_\text{Bayes}) + 2 p_\text{DIC} \] can be assigned to each model, and the model with lowest DIC chosen. Note that DIC asssumes residuals are independent given \(\hat{\theta}\)

Schwarz criterion / “Bayesian” information criterion (BIC, SBC, SIC, SBIC)

Goal: approximate marginal probability of the data \(p(y)\) (this is different)

Assuming the existence of a true model (\(\mathcal{M}-closed\)), the model that minimizes BIC converges to the “true” model. \[ \text{BIC} = k \ln (n) - 2 \ln \hat{\mathcal{L}} \] where \[ \hat{\mathcal{L}}= \max_\theta p(x | \theta, M) \] and where \(k\) is the number of model parameters. The BIC can be viewed as a rough approximation to the Bayes factor (Kass & Raftery, 1995).

Significance criteria

Use Null Hypothesis Significance Testing (NHST) to decide whether to include a variable. For example, should we add a trend term in our regression?

Form a null hypothesis: \(\beta = 0\)
Test statistics \(\Rightarrow\) \(p\)-value
If \(p < \alpha\) then use \(M_2\) else use \(M_1\)

Note that

This is equivalent to Bayes factor.
Still assumes existence of a true model (hence the many problems with NHST)

This is widely used in practice, often without justification

Key points: NO MAGIC HERE

You cannot look at a single criterion and decide whether a model is good or not; beware those who do!
Model comparison and selection is subjective 🤷‍♂️
- Make your assumptions transparent so others can follow and critique them, rather than pretending to be objective (Doss-Gollin & Keller, 2023)
- Subjective doesn’t mean arbitrary or “in the dark”. We know stuff!

References

Doss-Gollin, J., & Keller, K. (2023). A subjective Bayesian framework for synthesizing deep uncertainties in climate risk management. Earth’s Future, 11(1). https://doi.org/10.1029/2022EF003044

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572

McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan (Second edition.). Boca Raton ; CRC Press, Taylor & Francis Group.

Piironen, J., & Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3), 711–735. https://doi.org/10.1007/s11222-016-9649-y