Motivation
We have to make choices about which distribution to use, which covariates (if any) to use for nonstationarity, which (if any) parameters to model as nonstationary, how to pool information across space, etc. There is no single “right” answer; how can we proceed in a principled way?
This is a general problem in statistics beyond extreme value analysis.
Credit
This content borrows heavily from a literature review that I developed with Vivek Srikrishnan.
For a more accessible discussion, see Chapter 7 of McElreath (2020). For a more technical discussion, see Piironen & Vehtari (2017).
The challenge
We want to make probabilistic predictions about unobserved data \(\tilde{y}\). This is hard because Earth systems are:
- high-dimensional
- multi-scale
- nonlinear / complex
To approximate the true system, we come up with a model space \(\mathcal{M}\) defining a family of candidate models, then use them to make predictions.
Note
This content gets fairly technical. You will not be tested on the equations, but you should understand the key points (which we will recap).
How similar are two distributions?
\[
D_\text{KL} (P \parallel Q) = \sum_{x \in \mathcal{X}} P(x) \log \left[ \frac{P(x)}{Q(x)} \right]
\]
One interpretation of \(D_\text{KL} (P \parallel Q)\) is the measure of information gained by revising one’s beliefes from the prior distribution \(Q\) to the posterior distribution \(P\). Another interpretation is the amount of information lost when \(Q\) is used to approximate \(P\). Note that for continuous RVs the above sum can be written as an integral.
Measures of predictive accuracy
Predictive performance of a model defined in terms of a utility function \(u(M, \tilde{y})\). Commonly used: log predictive density: \[
\log p(\tilde{y} | D, M).
\] Future observations \(\tilde{y}\) are unknown, so we must approach it in expectation: \[
\overline{u}(M) = \mathbb{E}\left[ \log p(\tilde{y} | D, M) \right] = \int p_t(\tilde{y}) \log [(\tilde{y} | D, M) d\tilde{y}
\] where \(p_t(\tilde{y})\) is the true data generating distribution (unknown!)
Maximizing \(\overline{u}(M)\) is equivalent to minimizing KL divergence from candidate model \(p(\tilde{y} | D, M)\) to true data distribution \(p_t(\tilde{y})\)
In practice we work with estimates
We don’t know the true distribution \(\theta\) so we have to approximate it. The log pointwise predictive density is \[
\begin{align}
\text{lppd} &= \log \prod_{i=1}^N p_\text{post}(y_i) = \sum_{i=1}^N \log \int p(y_i | \theta) p_\text{post} (\theta) d \theta \\
&\approx \sum_{i=1}^N \log \left[ \frac{1}{S} \sum_{i=1}^S p(y_i | \theta^s) \right]
\end{align}
\] where we have approximated the posterior with \(S\) simulations from the posterior (eg, using MCMC).
the LPPD of observed data \(y\) is an overestimate of the expected LPPD for future data. Thus tools will start with our approximate form and then derive some correction.
Model combination
We could treat “which model to use” as a parameter. If we have an exhaustive list of candidate models \(\{ M_\ell \}_{\ell=1}^L\), then the distribution over the model space is given by \[
p(M | D) \propto p(D | M) p(M)
\] and we can average over them \[
p(\tilde{y} | D) = \sum_{\ell=1}^L p(\tilde{y}|D, M_\ell) p(M_\ell | D)
\]
AIC Criterion
If the model estimates \(k\) parameters, and if they are assumed asymptotically normal (ie a normal linear model with known variance and uniofrm prior) then fitting \(k\) parameters will increase the predictive accuracy by chance alone: \[
\hat{\text{elpd}}_\text{AIC} = \log p(y | \hat{\theta}_\text{mle}) - k
\] ::: {.fragment} Thus we can define \[
\text{AIC} = 2 k - 2 \ln \hat{\mathcal{L}}
\] and select the model that minimizes it. ::: ::: {.fragment} For complicated models, what is \(k\)? There are formula to approximate effective number of parameters. Note that AIC asssumes residuals are independent given \(\hat{\theta}\) :::
## DIC Criterion
- Start with AIC
- Replace \(\hat{\theta}_\text{mle}\) by posterior mean \(\hat{\theta}_\text{Bayes} = \mathbb{E}[\theta | y]\)
- Replace \(k\) by a data-based bias correction; there are different forms
\[
\hat{\text{elpd}}_\text{DIC} = \log p(y | \hat{\theta}_\text{Bayes}) - p_\text{DIC}
\] where \(p_\text{DIC}\) is derived from assumptions about the effective number of parameters. The quantity \[
\text{DIC} = -2 \log p(y | \hat{\theta}_\text{Bayes}) + 2 p_\text{DIC}
\] can be assigned to each model, and the model with lowest DIC chosen. Note that DIC asssumes residuals are independent given \(\hat{\theta}\)
Significance criteria
Use Null Hypothesis Significance Testing (NHST) to decide whether to include a variable. For example, should we add a trend term in our regression?
- Form a null hypothesis: \(\beta = 0\)
- Test statistics \(\Rightarrow\) \(p\)-value
- If \(p < \alpha\) then use \(M_2\) else use \(M_1\)
Note that
- This is equivalent to Bayes factor.
- Still assumes existence of a true model (hence the many problems with NHST)
This is widely used in practice, often without justification
Key points: NO MAGIC HERE
- You cannot look at a single criterion and decide whether a model is good or not; beware those who do!
- Model comparison and selection is subjective 🤷♂️
- Make your assumptions transparent so others can follow and critique them, rather than pretending to be objective (Doss-Gollin & Keller, 2023)
- Subjective doesn’t mean arbitrary or “in the dark”. We know stuff!