Bayesian statistics and computation


Lecture

2023-09-13

Last time

  1. Parametric uncertainty matters for decision-making
  2. As we collect more data, fewer combinations of parameters are consistent with observations

Logistics preview

  • No class Monday
  • Today: focus on the methods and ideas
  • Monday, asynchronously, work through the code

Prior information

Today

  1. Prior information

  2. Bayesian Inference

  3. Sampling

  4. Example: storm surge distribution

  5. Alternative priors

  6. Wrapup

Rare disease

  1. Everyone is tested for CEVE543acitis, a rare and deadly disease
  2. It is known that 1 in 1,000 people have CEVE543acitis
  3. The test is 99% accurate
  4. Your test comes back positive. What is the probability that you have CEVE543acitis?

Bayes’ rule: discrete event version

\[ \Pr \left\{ \theta | y \right\} = \frac{\Pr \left\{ y | \theta \right\} \Pr \left\{ \theta \right\}}{\Pr \left\{ y \right\}} \]

Application: rare disease

Define \(y\) is getting a positive test result and \(\theta\) is having the underlying condition. Not that we do not observe \(\theta\) directly! Here \(y=1\) and we want to know \(\Pr\left\{\theta = 1 \mid y=1 \right\}\).

Likelihood:

\(\Pr\left\{y = 1 \ldots\right.\) \(\Pr\left\{y = 0 |\ldots \right.\)
\(\left. ...\theta=1 \right\}\) 0.99 0.01
\(\left. ...\theta=0\right\}\) 0.01 0.99

. . . A naive application of maximum likelihood: \(\Pr\left\{y=1 \mid \theta=1 \right\} > \Pr\left\{y=1 \mid \theta=0 \right\}\) so best estimate is \(\theta=1\)

Solving

We are studying \(\Pr\left\{\theta = 1 | y = 1 \right\}\).

  1. Step 1: \(\Pr\left\{ y = 1 \right\}\)
    1. \(\Pr\left\{y=1\right\} = \Pr \left\{ y=1, \theta=0 \right\} + \Pr \left\{ y=1, \theta=1 \right\}\)
    2. \(\Pr\left\{y=1\right\} = \Pr \left\{ \theta=0 \right\} \Pr \left\{ y = 1 | \theta=0 \right\} + \Pr \left\{ \theta=1 \right\} \Pr \left\{ y=1 | \theta=0 \right\}\)
    3. \(\underbrace{\Pr\left\{y=1\right\}}_\text{test +} = \underbrace{0.999}_\text{don't have it} \times \overbrace{0.01}^\text{false +} + \underbrace{0.001}_\text{have it} \times \overbrace{0.99}^\text{true +}\)
  2. Now plug in to Bayes’ rule
    1. \(\Pr\left\{ \theta=1 | y=1 \right\} = \frac{\Pr\left\{y=1 | \theta=1 \right\} \Pr\left\{ \theta=1 \right\}}{\Pr\left\{y=1\right\}}\)
    2. \(\Pr\left\{\theta=1 | y=1 \right\} = \frac{0.99 \times 0.001}{\Pr\left\{y=1\right\}}\)

Implementation

accuracy = 0.99
pr_disease = 0.001 # p(θ = 1)
pr_positive_test = accuracy * pr_disease + (1 - accuracy) * (1 - pr_disease) # p(y = 1)
pr_disease_given_test = accuracy * pr_disease / pr_positive_test # p(θ = 1 | y = 1)
display(round(pr_positive_test; digits=5))
display(round(pr_disease_given_test; digits=5))
0.01098
0.09016

Bayesian Inference

Today

  1. Prior information

  2. Bayesian Inference

  3. Sampling

  4. Example: storm surge distribution

  5. Alternative priors

  6. Wrapup

Key idea

  1. Parameters have probability distributions, not single point values
  2. Start with some prior distribution for parameters
  3. Goal: what is the distribution of the parameters given the data?

Bayes’ rule for distributions

\[ p(\theta \mid y) = \frac{p(y \mid \theta) p(\theta)}{p(y)} \]

If we are drawing samples from a distribution, we can calculate up to a constant of proportionality and – since \(p(y)\) doesn’t depend on \(\theta\) – we can usually ignore it.

\[ \overbrace{p(\theta \mid y)}^\rm{posterior} \propto \underbrace{p(y \mid \theta)}_\rm{likelihood} \overbrace{p(\theta)}^\rm{prior} \]## Coin flipping

We flip a coin a few times. We want to estimate the probability of heads so that we can make well-calibrated bets on future coin tosses.

coin_flips = ["H", "H", "H", "T", "H", "H", "H", "H", "H"]
heads = [flip == "H" for flip in coin_flips]
N = length(coin_flips)
n_heads = sum(heads)
8

Maximum likelihood

Maximum likelihood estimate (MLE) is the most likely value of \(\theta\) given the data. As before, we can use our log-likelihood.

1flip_log_like(θ) = sum(logpdf.(Bernoulli(θ), heads))
loss(θ) = -flip_log_like(θ)
2θ_mle = optimize(loss, 0, 1).minimizer
plot(flip_log_like, 0.1, 1; label=L"$p(y | \theta)$")
vline!([θ_mle]; label="MLE", linewidth=3)
1
This builds on what we did last time. A coin flip is represented by a Bernoulli process. In fact, we could use a Binomial distribution to model the number of heads in \(N\) flips.
2
The maximum likelihood estimate can in fact be shown to be exactly n_heads / length(coin_flips)

Prior

We should be suspicious of our analysis when it concludes that we will continue to see 8 out of 9 flips coming up heads forever.

To perform a Bayesian analysis, we’ll need a prior. A Beta distribution is a natural choice for a prior on a probability, although we could use a Uniform distribution or even something silly like a truncated Gamma (don’t!)

prior_dist = Beta(5, 5)
plot(prior_dist; label=false, xlabel=L"$θ$", ylabel=L"$p(θ)$", linewidth=3)

Closed-form solution

Cool property: if you have a Beta prior and a Binomial likelihood, the posterior is also Beta distributed. Look up Beta-Binomial conjugacy for more! We will leverage this property to check our answers.

closed_form = Beta(prior_dist.α + n_heads, prior_dist.β + N - n_heads)

Sampling

Today

  1. Prior information

  2. Bayesian Inference

  3. Sampling

  4. Example: storm surge distribution

  5. Alternative priors

  6. Wrapup

Markov Chain Monte Carlo

  1. A class of methods for sampling from a probability distribution
  2. Random walkers:
    1. Start at some value
    2. Propose a new value
    3. Accept or reject the new value based on some criteria
  3. Repeat to get a “chain” of samples

Metropolis-Hastings

See the very good Wikipedia article

log_posterior(θ) = logpdf(prior_dist, θ) + flip_log_like(θ)

θ_samples = []
θ_sample = 0.5 # initial guess
proposal_dist(θ) = Uniform(0, 1) # propose new values based on the current value

while length(θ_samples) < 10_000
    proposal = rand(proposal_dist(θ_sample)) # propose a new value
    p_accept = min(exp(log_posterior(proposal) - log_posterior(θ_sample)), 1)
    if rand() < p_accept
        θ_sample = proposal
    end
    push!(θ_samples, θ_sample)
end
histogram(
    θ_samples;
    normalize=:pdf,
    label="Samples",
    legend=:topleft,
    xlabel=L"$θ$",
    ylabel=L"$p| y)$"
)
plot!(closed_form; label="Exact Posterior", linewidth=3)

Limitations

  1. Works great for a very simple problem
  2. Computation blows up in higher dimensions (p_accept gets very small)
  3. Have to code a new sampler for each problem

Modern samplers leverage gradients and clever tricks to draw better samples for harder problems. Let’s use them!

Turing model specification

We can write down the full Bayesian model in Turing, which uses a syntax very close to our notation!

@model function coinflip(y)

    # to define θ as a random variable, we use ~
    # anything that's not an input (data) is treated as a parameter to be estimated!
    θ ~ prior_dist

    # the data generating process
    return y .~ Bernoulli(θ)
end

Turing sampling

We can leverage sophisticated machinery for drawing samples from arbitrary posterior distributions. For now, we will trust that it is drawing samples from \(p(y | \theta)\) and not worry about the details.

coin_chain = let # variables defined in a let...end block are temporary
    model = coinflip(heads)
    sampler = externalsampler(DynamicHMC.NUTS())
    nsamples = 10_000
    sample(model, sampler, nsamples; drop_warmup=true)
end
summarystats(coin_chain)
Summary Statistics
  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat      Symbol   Float64   Float64   Float64     Float64     Float64   Float64   ⋯
           θ    0.6832    0.1021    0.0016   4234.2533   5796.1544    1.0003   ⋯
                                                                1 column omitted

Visualize

We can visualize our posterior

histogram( coin_chain[:θ]; label=“Samples”, normalize=:pdf, legend=:topleft, xlabel=L”\(θ\)“, ylabel=L”\(p(θ | y)\)” ) plot!(closed_form; label=“Exact Posterior”, linewidth=3) plot!(prior_dist; label=“Prior”, linewidth=3) vline!([θ_mle]; label=“MLE”, linewidth=3) ```## Compromise

The posterior is a compromise between the prior and the likelihood.

  • Bad priors lead to bad inferences
  • The choice of prior is subjective, which some people hate,
    • We will approach this in a principled manner (Gelman et al., 2020; Gelman & Shalizi, 2013)
    • Lots of other steps are also subjective (choice of likelihood model, which data to use, problem framing, etc)
    • False sense of objectivity is dangerous anyways!

Example: storm surge distribution

Today

  1. Prior information

  2. Bayesian Inference

  3. Sampling

  4. Example: storm surge distribution

  5. Alternative priors

  6. Wrapup

Read data

Code
annmax = CSV.read("data/8638610-annmax.csv", DataFrame)
annmax.lsl .*= u"ft"
annmax.lsl_ft = ustrip.(u"ft", annmax.lsl)
p1 = plot(
    annmax.year,
    annmax.lsl;
    xlabel="Year",
    ylabel="Ann. Max. Water Level",
    label=false,
    marker=:circle
)
p2 = histogram(
    annmax.lsl;
    normalize=:pdf,
    orientation=:horizontal,
    label=false,
    xlabel=false,
    ylabel="",
    yticks=[],
    bins=2:0.3:8,
    xlims=(0, 0.8)
)

l = @layout [a{0.7w} b{0.3w}]
plot(p1, p2; layout=l, link=:y, ylims=(2, 8), suptitle="Sewell's Point, VA")

Model

Define a LogNormal distribution with very diffuse (flat) priors

@model function lognormal_flatpriors(y)
    # define the parameters
    # and assign prior
    μ ~ Normal(0, 10) # Extremely diffuse prior
    σ ~ truncated(Normal(0, 10), 0, Inf) # σ must be positive

    # data generating process
    return y .~ LogNormal(μ, σ)
end

Sample

ln_flat_chn = let
    model = lognormal_flatpriors(annmax.lsl_ft)
    sampler = externalsampler(DynamicHMC.NUTS())
    nsamples = 20_000
    sample(model, sampler, nsamples; drop_warmup=true)
end
summarystats(ln_flat_chn)

Posterior

We leverage the histogram2d function to visualize the 2D posterior distribution.

Code
post1_scatter = histogram2d(
    ln_flat_chn[:μ],
    ln_flat_chn[:σ];
    label=false,
    xlabel=L"\mu",
    ylabel=L"\sigma",
    title="Diffuse Priors",
    normalize=:pdf,
    clims=(0, 1000),
    bins=100
)

Return period with uncertainty

Each draw from the posterior represents a plausible value of \(\mu\) and \(\sigma\). We can use these to explore the distribution of return periods.

Code
for idx in 1:500
    μ = ln_flat_chn[:μ][idx]
    σ = ln_flat_chn[:σ][idx]
    rt = quantile.(LogNormal(μ, σ), aeps)
    label = idx == 1 ? "Posterior" : false
    plot!(plt_rt, rts, rt; color=:black, alpha=0.05, label=label, linewidth=0.5)
end
plt_rt