Bayesian statistics and computation

Lecture

2023-09-13

Last time

Parametric uncertainty matters for decision-making
As we collect more data, fewer combinations of parameters are consistent with observations

Logistics preview

No class Monday
Today: focus on the methods and ideas
Monday, asynchronously, work through the code

Prior information

Today

Prior information
Bayesian Inference
Sampling
Example: storm surge distribution
Alternative priors
Wrapup

Rare disease

Everyone is tested for CEVE543acitis, a rare and deadly disease
It is known that 1 in 1,000 people have CEVE543acitis
The test is 99% accurate
Your test comes back positive. What is the probability that you have CEVE543acitis?

Bayes’ rule: discrete event version

\[ \Pr \left\{ \theta | y \right\} = \frac{\Pr \left\{ y | \theta \right\} \Pr \left\{ \theta \right\}}{\Pr \left\{ y \right\}} \]

Application: rare disease

Define \(y\) is getting a positive test result and \(\theta\) is having the underlying condition. Not that we do not observe \(\theta\) directly! Here \(y=1\) and we want to know \(\Pr\left\{\theta = 1 \mid y=1 \right\}\).

Likelihood:

	\(\Pr\left\{y = 1 \ldots\right.\)	\(\Pr\left\{y = 0 \|\ldots \right.\)
\(\left. ...\theta=1 \right\}\)	0.99	0.01
\(\left. ...\theta=0\right\}\)	0.01	0.99

. . . A naive application of maximum likelihood: \(\Pr\left\{y=1 \mid \theta=1 \right\} > \Pr\left\{y=1 \mid \theta=0 \right\}\) so best estimate is \(\theta=1\)

Solving

We are studying \(\Pr\left\{\theta = 1 | y = 1 \right\}\).

Step 1: \(\Pr\left\{ y = 1 \right\}\)
1. \(\Pr\left\{y=1\right\} = \Pr \left\{ y=1, \theta=0 \right\} + \Pr \left\{ y=1, \theta=1 \right\}\)
2. \(\Pr\left\{y=1\right\} = \Pr \left\{ \theta=0 \right\} \Pr \left\{ y = 1 | \theta=0 \right\} + \Pr \left\{ \theta=1 \right\} \Pr \left\{ y=1 | \theta=0 \right\}\)
3. \(\underbrace{\Pr\left\{y=1\right\}}_\text{test +} = \underbrace{0.999}_\text{don't have it} \times \overbrace{0.01}^\text{false +} + \underbrace{0.001}_\text{have it} \times \overbrace{0.99}^\text{true +}\)
Now plug in to Bayes’ rule
1. \(\Pr\left\{ \theta=1 | y=1 \right\} = \frac{\Pr\left\{y=1 | \theta=1 \right\} \Pr\left\{ \theta=1 \right\}}{\Pr\left\{y=1\right\}}\)
2. \(\Pr\left\{\theta=1 | y=1 \right\} = \frac{0.99 \times 0.001}{\Pr\left\{y=1\right\}}\)

Implementation

accuracy = 0.99
pr_disease = 0.001 # p(θ = 1)
pr_positive_test = accuracy * pr_disease + (1 - accuracy) * (1 - pr_disease) # p(y = 1)
pr_disease_given_test = accuracy * pr_disease / pr_positive_test # p(θ = 1 | y = 1)
display(round(pr_positive_test; digits=5))
display(round(pr_disease_given_test; digits=5))

0.01098

0.09016

Bayesian Inference

Today

Prior information
Bayesian Inference
Sampling
Example: storm surge distribution
Alternative priors
Wrapup

Key idea

Parameters have probability distributions, not single point values
Start with some prior distribution for parameters
Goal: what is the distribution of the parameters given the data?

Bayes’ rule for distributions

\[ p(\theta \mid y) = \frac{p(y \mid \theta) p(\theta)}{p(y)} \]

If we are drawing samples from a distribution, we can calculate up to a constant of proportionality and – since \(p(y)\) doesn’t depend on \(\theta\) – we can usually ignore it.

\[ \overbrace{p(\theta \mid y)}^\rm{posterior} \propto \underbrace{p(y \mid \theta)}_\rm{likelihood} \overbrace{p(\theta)}^\rm{prior} \]## Coin flipping

We flip a coin a few times. We want to estimate the probability of heads so that we can make well-calibrated bets on future coin tosses.

coin_flips = ["H", "H", "H", "T", "H", "H", "H", "H", "H"]
heads = [flip == "H" for flip in coin_flips]
N = length(coin_flips)
n_heads = sum(heads)

Maximum likelihood

Maximum likelihood estimate (MLE) is the most likely value of \(\theta\) given the data. As before, we can use our log-likelihood.

1flip_log_like(θ) = sum(logpdf.(Bernoulli(θ), heads))
loss(θ) = -flip_log_like(θ)
2θ_mle = optimize(loss, 0, 1).minimizer
plot(flip_log_like, 0.1, 1; label=L"$p(y | \theta)$")
vline!([θ_mle]; label="MLE", linewidth=3)

1: This builds on what we did last time. A coin flip is represented by a Bernoulli process. In fact, we could use a Binomial distribution to model the number of heads in \(N\) flips.
2: The maximum likelihood estimate can in fact be shown to be exactly n_heads / length(coin_flips)

Prior

We should be suspicious of our analysis when it concludes that we will continue to see 8 out of 9 flips coming up heads forever.

To perform a Bayesian analysis, we’ll need a prior. A Beta distribution is a natural choice for a prior on a probability, although we could use a Uniform distribution or even something silly like a truncated Gamma (don’t!)

prior_dist = Beta(5, 5)
plot(prior_dist; label=false, xlabel=L"$θ$", ylabel=L"$p(θ)$", linewidth=3)

Closed-form solution

Cool property: if you have a Beta prior and a Binomial likelihood, the posterior is also Beta distributed. Look up Beta-Binomial conjugacy for more! We will leverage this property to check our answers.

closed_form = Beta(prior_dist.α + n_heads, prior_dist.β + N - n_heads)

Sampling

Today

Prior information
Bayesian Inference
Sampling
Example: storm surge distribution
Alternative priors
Wrapup

Markov Chain Monte Carlo

A class of methods for sampling from a probability distribution
Random walkers:
1. Start at some value
2. Propose a new value
3. Accept or reject the new value based on some criteria
Repeat to get a “chain” of samples

Metropolis-Hastings

See the very good Wikipedia article

log_posterior(θ) = logpdf(prior_dist, θ) + flip_log_like(θ)

θ_samples = []
θ_sample = 0.5 # initial guess
proposal_dist(θ) = Uniform(0, 1) # propose new values based on the current value

while length(θ_samples) < 10_000
    proposal = rand(proposal_dist(θ_sample)) # propose a new value
    p_accept = min(exp(log_posterior(proposal) - log_posterior(θ_sample)), 1)
    if rand() < p_accept
        θ_sample = proposal
    end
    push!(θ_samples, θ_sample)
end
histogram(
    θ_samples;
    normalize=:pdf,
    label="Samples",
    legend=:topleft,
    xlabel=L"$θ$",
    ylabel=L"$p(θ | y)$"
)
plot!(closed_form; label="Exact Posterior", linewidth=3)

Limitations

Works great for a very simple problem
Computation blows up in higher dimensions (p_accept gets very small)
Have to code a new sampler for each problem

Modern samplers leverage gradients and clever tricks to draw better samples for harder problems. Let’s use them!

Turing model specification

We can write down the full Bayesian model in Turing, which uses a syntax very close to our notation!

@model function coinflip(y)

    # to define θ as a random variable, we use ~
    # anything that's not an input (data) is treated as a parameter to be estimated!
    θ ~ prior_dist

    # the data generating process
    return y .~ Bernoulli(θ)
end

Turing sampling

We can leverage sophisticated machinery for drawing samples from arbitrary posterior distributions. For now, we will trust that it is drawing samples from \(p(y | \theta)\) and not worry about the details.

coin_chain = let # variables defined in a let...end block are temporary
    model = coinflip(heads)
    sampler = externalsampler(DynamicHMC.NUTS())
    nsamples = 10_000
    sample(model, sampler, nsamples; drop_warmup=true)
end
summarystats(coin_chain)

Summary Statistics
  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat   ⋯
      Symbol   Float64   Float64   Float64     Float64     Float64   Float64   ⋯
           θ    0.6832    0.1021    0.0016   4234.2533   5796.1544    1.0003   ⋯
                                                                1 column omitted

Visualize

We can visualize our posterior

histogram( coin_chain[:θ]; label=“Samples”, normalize=:pdf, legend=:topleft, xlabel=L”\(θ\)“, ylabel=L”\(p(θ | y)\)” ) plot!(closed_form; label=“Exact Posterior”, linewidth=3) plot!(prior_dist; label=“Prior”, linewidth=3) vline!([θ_mle]; label=“MLE”, linewidth=3) ```## Compromise

The posterior is a compromise between the prior and the likelihood.

Bad priors lead to bad inferences
The choice of prior is subjective, which some people hate,
- We will approach this in a principled manner (Gelman et al., 2020; Gelman & Shalizi, 2013)
- Lots of other steps are also subjective (choice of likelihood model, which data to use, problem framing, etc)
- False sense of objectivity is dangerous anyways!

Example: storm surge distribution

Today

Prior information
Bayesian Inference
Sampling
Example: storm surge distribution
Alternative priors
Wrapup

Read data

Code

annmax = CSV.read("data/8638610-annmax.csv", DataFrame)
annmax.lsl .*= u"ft"
annmax.lsl_ft = ustrip.(u"ft", annmax.lsl)
p1 = plot(
    annmax.year,
    annmax.lsl;
    xlabel="Year",
    ylabel="Ann. Max. Water Level",
    label=false,
    marker=:circle
)
p2 = histogram(
    annmax.lsl;
    normalize=:pdf,
    orientation=:horizontal,
    label=false,
    xlabel=false,
    ylabel="",
    yticks=[],
    bins=2:0.3:8,
    xlims=(0, 0.8)
)

l = @layout [a{0.7w} b{0.3w}]
plot(p1, p2; layout=l, link=:y, ylims=(2, 8), suptitle="Sewell's Point, VA")

Model

Define a LogNormal distribution with very diffuse (flat) priors

@model function lognormal_flatpriors(y)
    # define the parameters
    # and assign prior
    μ ~ Normal(0, 10) # Extremely diffuse prior
    σ ~ truncated(Normal(0, 10), 0, Inf) # σ must be positive

    # data generating process
    return y .~ LogNormal(μ, σ)
end

Sample

ln_flat_chn = let
    model = lognormal_flatpriors(annmax.lsl_ft)
    sampler = externalsampler(DynamicHMC.NUTS())
    nsamples = 20_000
    sample(model, sampler, nsamples; drop_warmup=true)
end
summarystats(ln_flat_chn)

Posterior

We leverage the histogram2d function to visualize the 2D posterior distribution.

Code

post1_scatter = histogram2d(
    ln_flat_chn[:μ],
    ln_flat_chn[:σ];
    label=false,
    xlabel=L"\mu",
    ylabel=L"\sigma",
    title="Diffuse Priors",
    normalize=:pdf,
    clims=(0, 1000),
    bins=100
)

Return period with uncertainty

Each draw from the posterior represents a plausible value of \(\mu\) and \(\sigma\). We can use these to explore the distribution of return periods.

Code

for idx in 1:500
    μ = ln_flat_chn[:μ][idx]
    σ = ln_flat_chn[:σ][idx]
    rt = quantile.(LogNormal(μ, σ), aeps)
    label = idx == 1 ? "Posterior" : false
    plot!(plt_rt, rts, rt; color=:black, alpha=0.05, label=label, linewidth=0.5)
end
plt_rt

Trace plot

Visualize the samples as a chain

plot(ln_flat_chn)

Alternative priors

Today

Prior information
Bayesian Inference
Sampling
Example: storm surge distribution
Alternative priors
Wrapup

Model

We can treat the priors as parameters so that we don’t have to define a new @model each time we want to update our priors

1@model function lognormal(y, μ_dist, σ_dist)
    μ ~ μ_dist
    σ ~ σ_dist
    return y .~ LogNormal(μ, σ)
end

1: No reason why we can’t pass distributions as functional arguments

Guess and prior predictive check

Define priors

μ_prior = Normal(3, 3)
σ_prior = truncated(Normal(0, 3), 0, Inf)

Draw samples from the prior

ln_ppc = let
    model = lognormal(annmax.lsl_ft, μ_prior, σ_prior)
    sampler = Prior()
    nsamples = 10_000
    sample(model, sampler, nsamples; drop_warmup=true)
end

Plot the consequences of these samples

Code

plt_prior_1 = plot(plt_rt_base; yticks=10 .^ collect(0:12))
for idx in 1:1_000
    μ = ln_ppc[:μ][idx]
    σ = ln_ppc[:σ][idx]
    rt = quantile.(LogNormal(μ, σ), aeps)
    label = idx == 1 ? "Prior" : false
    plot!(plt_prior_1, rts, rt; color=:black, alpha=0.1, label=label)
end
plt_prior_1

Revise

If we are getting return levels of \(10^{12}\) ft, we should probably revise our priors

1μ_prior = Normal(1, 1)
σ_prior = truncated(Normal(0, 1), 0, Inf)

1: Yes, I’m overwriting the old value

We can sample

Code

ln_ppc = let
    model = lognormal(annmax.lsl_ft, μ_prior, σ_prior)
    sampler = Prior()
    nsamples = 10_000
    sample(model, sampler, nsamples; drop_warmup=true)
end

plt_prior_2 = plot(plt_rt_base; yticks=10 .^ collect(0:5))
for idx in 1:1_000
    μ = ln_ppc[:μ][idx]
    σ = ln_ppc[:σ][idx]
    rt = quantile.(LogNormal(μ, σ), aeps)
    label = idx == 1 ? "Prior" : false
    plot!(plt_prior_2, rts, rt; color=:black, alpha=0.1, label=label)
end
plt_prior_2

Getting closer

Code

ln_ppc = let
    model = lognormal(annmax.lsl_ft, μ_prior, σ_prior)
    sampler = Prior()
    nsamples = 5000
    sample(model, sampler, nsamples; drop_warmup=true)
end

plt_prior_3 = plot(plt_rt_base; yticks=10 .^ collect(0:5))
for idx in 1:1_000
    μ = ln_ppc[:μ][idx]
    σ = ln_ppc[:σ][idx]
    rt = quantile.(LogNormal(μ, σ), aeps)
    label = idx == 1 ? "Prior" : false
    plot!(plt_prior_3, rts, rt; color=:black, alpha=0.1, label=label)
end
plt_prior_3

μ_prior = Normal(1, 1)
σ_prior = truncated(Normal(0, 0.5), 0, Inf)

Now get posterior

We use the same model to get the posterior. Often we want to run multiple chains with different initial values to make sure we are getting good samples.

ln_post = let
    model = lognormal(annmax.lsl_ft, μ_prior, σ_prior)
    sampler = externalsampler(DynamicHMC.NUTS())
    n_per_chain = 5000
    nchains = 4
    sample(model, sampler, MCMCThreads(), n_per_chain, nchains; drop_warmup=true)
end

summarystats(ln_post)

Summary Statistics
  parameters      mean       std      mcse     ess_bulk     ess_tail      rhat ⋯
      Symbol   Float64   Float64   Float64      Float64      Float64   Float64 ⋯
           μ    1.3692    0.0194    0.0001   17256.1692   13201.9364    1.0003 ⋯
           σ    0.1861    0.0138    0.0001   17431.4216   13529.7674    1.0001 ⋯
                                                                1 column omitted

Traceplot for multiple chains

plot(ln_post)

Visualize

Code

post2_scatter = histogram2d(
    vec(ln_post[:μ]),
    vec(ln_post[:σ]);
    label=false,
    xlabel=L"\mu",
    title="More informed priors",
    normalize=:pdf,
    yticks=[],
    clims=(0, 1000),
    bins=100
)
plot(plot(post1_scatter), post2_scatter; link=:both)

Note

Here our likelihood is very informative, so it doesn’t much matter if we use excessively diffuse priors. This is nice, though not something we can count on in general.

Return period with uncertainty

As before, we can visualize our posterior distribution in terms of return periods

Code

plt_rt = plot(plt_rt_base)
scatter!(plt_rt, 1 ./ xp, ys; label="Observations", color=:gray, alpha=1)
for idx in 1:500
    μ = ln_post[:μ][idx]
    σ = ln_post[:σ][idx]
    rt = quantile.(LogNormal(μ, σ), aeps)
    label = idx == 1 ? "Posterior" : false
    plot!(plt_rt, rts, rt; color=:black, alpha=0.05, label=label)
end
plt_rt

Wrapup

Today

Prior information
Bayesian Inference
Sampling
Example: storm surge distribution
Alternative priors
Wrapup

Key value add of Bayesian inference

Draw samples from tricky posteriors to compute expectations \(\mathbb{E}[f(\theta)]\)
Quantify parametric uncertainty
1. In practice, sometimes this is a big deal and sometimes model structure uncertainties matter more
Force us to specify a data generating process
Computational methods fail loudly

Learning Turing

The official docs are great.

Warning

Google will often try to link you to the old site, https://turing.ml/. This is out of date! Use https://turinglang.org/stable/ instead.

Another word on generative AI

GitHub Copilot
- Free for students: instructions
GPT and related language models

Do not just plug in the problem and paste the solution!

Labs are just 10% of your grade
You will be resonsible for material on projects / tests
You won’t learn

Do use it to for syntax help and code explanations

Logistics

Friday:
1. Lab 04
2. Lab 03 due
Monday 9/18: no class. Work on lab 04 and work through these slides.
Tuesday 9/19: no office hours.
Wednesday 9/20: test I review
Friday 9/22: test I
You may turn in lab 04 by 9/29 (two weeks)

References

Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8–38. https://doi.org/f4k2h4

Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., et al. (2020, November 3). Bayesian workflow. https://doi.org/10.48550/arXiv.2011.01808