using CSV
using DataFrames
using Distributions
using DynamicHMC
using LaTeXStrings
using LinearAlgebra: I
using Plots
using StatsPlots
using Turing
Lab 04
Regression with Bayesian Methods
1 Problem statement
Use linear regression to relate the mean annual runoff of several streams (runoff
) with nine other variables:
- the precipitation falling at the gage (
precip
), - the drainage area of the basin (
area
), - the average slope of the basin (
slope
) - the length of the drainage basin (
length
) - the perimeter of the basin (
perim
) - the diameter of the largest circle which could be inscribed within the drainage basin (
diameter
) - the shape factor of the basin (
shape_factor
) - the stream frequency – the ratio of the number of streams in the basin to the basin area (
stream_freq
) - the relief ratio for the basin (
relief_ratio
)
This is a light modification of Example 11.3 from
Helsel, D. R., Hirsch, R. M., Ryberg, K. R., Archfield, S. A., & Gilroy, E. J. (2020). Statistical methods in water resources. Techniques and Methods. U.S. Geological Survey. https://doi.org/10.3133/tm4A3
2 Setup
First, we load the packages that we will use
2.1 Data
Next, we need to load in the data.
= CSV.read("data/runoff.csv", DataFrame)
runoff_df display(size(runoff_df))
first(runoff_df, 3)
(13, 10)
Row | runoff | precip | area | slope | length | perim | diameter | shape_factor | stream_freq | relief_ratio |
---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Int64 | |
1 | 17.38 | 44.37 | 2.21 | 50 | 2.38 | 7.93 | 0.91 | 0.38 | 1.36 | 332 |
2 | 14.62 | 44.09 | 2.53 | 7 | 2.55 | 7.65 | 1.23 | 0.48 | 2.37 | 55 |
3 | 15.48 | 41.25 | 5.63 | 19 | 3.11 | 11.61 | 2.11 | 0.57 | 2.31 | 77 |
3 Model explanation and exploration
3.1 Model specification
We are going to fit models of the form \[ y_i \sim \mathcal{N} \left( \beta X_i , \sigma \right). \] We only have 13 observations of 10 variables, so we’ll need to be careful with our data. Before diving into the model, let’s look at the data.
3.2 Pairwise correlations
One simple way to look at the data is to look at the pairwise correlations between the variables. The following function plot_correlation
will plot a heatmap of the pairwise correlations.
Code
function plot_correlation(df::AbstractDataFrame)
= names(df)
varnames = cor(Matrix(df))
correlations = size(correlations, 1)
P
= heatmap(
p # x labels
varnames, # y labels
varnames, # variable
correlations; =1, # 1:1 aspect ratio
aspect_ratio=30, # rotate x ticks
xrotation=(600, 600), # specify the size of the plot
size="Pairwise Correlations",
title=(-1, 1),
clims=:PuOr, # select a diverging color map
cmap
)
# add the exact text
= [
ann - 0.5, j - 0.5, text(round(correlations[i, j]; digits=2), 8, :white, :center))
(i in 1:P for j in 1:P
for i
]annotate!(p, ann; linecolor=:white)
return p
end
- 1
-
cor
expects aMatrix
input and will throw an error if we give it aDataFrame.
Matrix(df)
just extracts numeric values for ourDataFrame
. This won’t work well if we have non-numeric columns, but we don’t here.
plot_correlation(runoff_df)
Describe what you learn from this plot in the specific context of modeling the relationship between runoff
and the other variables. What can you not learn from this plot?
4 Linear regression
We can now fit a linear regression model to a single variable. You can pick whichever one you like; describe why you picked it.
We need priors on the intercept, slope, and standard deviation. Making some plausible choices, our full model is: \[ \begin{align} \alpha &\sim \mathcal{N}(0, 10) \\ \beta &\sim \mathcal{N}(0, 10) \\ \sigma &\sim \mathcal{N}^+(0, 10) \\ \mu &= \alpha + \beta X \\ y &\sim \mathcal{N}(\mu, \sigma) \end{align} \] where \(\mathcal{N}{+}\) is the half-normal distribution, i.e. a Normal distribution truncated at zero.
Update the above model to reflect the priors you end up choosing. You are welcome to use other distributions or parameter values. You will need to modify the LaTeX math, but only very minor edits are required.
4.1 Define model
We can implement this in Turing as follows:
#| output: false
@model function linear_reg(X::AbstractVector, y::AbstractVector)
# First, define all parameters
# they need priors so we use a ~
# the intercept
α # the slope
β # the standard deviation
σ
# We can calculate intermediate quantities in our function
# Turing knows this isn't a parameter because we use = not ~
= α .+ β .* X
μ
# likelihood model
...
y end
::: {.callout-important} ## Task
Implement the model above so that it is consistent with the mathematical model written above. You will iterate between this and the next step a few times. You only need to keep your final model, but briefly describe what you did and why. :::## Prior predictive check
Before we fit the model, we run a prior predictive check. This lets us check whether the prior assumptions we have encoded into the model are consisten with our beliefs. First we sample from the prior:
= vec(runoff_df[!, :runoff])
y = vec(runoff_df[!, :YOUR_VAR_HERE]) # which variable did you choose?
X = let
ppc_linear = linear_reg(X, y) # call the model
model = Prior() # we want to sample from the prior
sampler = 10_000
nsamples sample(model, sampler, nsamples; drop_warmup=true)
end
We can use the lm_predict
function to generate predictions from the prior.
Code
function lm_predict(chn::Turing.MCMCChains.Chains, X::AbstractVecOrMat, N::Int)
= get_params(chn)
p = vec(p[:α])
α = if isa(p[:β], Tuple)
β = hcat(vec.(p[:β])...)
β_matrix :] for i in 1:size(β_matrix, 1)]
[β_matrix[i, else
vec(p[:β])
end
= rand(1:length(α), N)
idx = [
ŷ .+ X * β[i] for i in idx
α[i]
]if N == 1
= ŷ[1]
ŷ end
return ŷ
end
- 1
- This function extracts the parameters from the chain
- 2
-
If we use multiple chains, then
p[:α]
is a Matrix. We convert it to a vector so that it’s “flat”. - 3
-
If we have multiple predictors (multivariate linear regression) then
p[:β]
is a tuple of vectors. We convert it into a vector of vectors so thatβ[i]
is a vector of coefficients for the \(i\)th sample. - 4
-
We randomly select
N
samples from the posterior. - 5
-
We calculate the predictions for each of the
N
samples. - 6
-
If
N
is 1, then we return a vector, not a vector of vectors.
We leverage this function to generate predictions from the prior and plot them.
= ... # specify the range of values for
X_pred = plot(xlabel="Precip", ylabel="Runoff", title="Prior predictive check")
p1 for i in 1:250
= lm_predict(ppc_linear, X_pred, 1)
y_pred plot!(p1, X_pred, y_pred; color=:black, alpha=0.1, label=false)
end
p1
Fill in the missing lines in the above blocks and run.
4.2 Inference
Once we have a prior that we’re happy with, we can use our model to generate samples from the posterior distribution. We draw 4 chains of 5000 samples. If you get a weird error here, it may be because of threading. Try the mapreduce
based approach here rather than waste time debugging.
= let
chn_linear = ... # FILL IN
model = externalsampler(DynamicHMC.NUTS())
sampler = 5000
n_per_chain = 4
nchains sample(model, sampler, MCMCThreads(), n_per_chain, nchains; drop_warmup=true)
end
summarystats(chn_linear)
Based on \(\hat{R}\) values close to 1, we have some evidence that our chains have converged. We can also check the trace plots:
... # Fill in to generate the trace plot
Fill in the missing lines in the above blocks. Describe what you learn from the trace plot.
4.3 Posterior predictive checks
We can again plot our predictions, this time using the posterior samples. We add the data to the plot as well.
= plot(xlabel="Precip", ylabel="Runoff", title="Posterior") # blank plot
p2 = lm_predict(chn_linear, X_pred, 500) # generate predictions
pred_linear for y_pred in pred_linear
plot!(p2, X_pred, y_pred; color=:black, alpha=0.1, label=false) # add each prediction to a plot
end
# add the observations with scatter!
p2
Add the observations to the plot above with scatter!
5 Multiple predictors
By using a model with only one predictor, we may be throwing away usful information from other variables.
#| output: false
@model function linear_reg(X::Matrix, y::AbstractVector)
= size(X) # number of observations, number of predictors
N, P
# priors
~ Normal(0, 5) # intercept
α ~ MvNormal(zeros(P), 5.0 * I)
β ~ LogNormal(2, 0.5)
σ
# intermediate quantity
# deterministic transformation
= α .+ X * β
μ
# likelihood
.~ Normal.(μ, σ)
y
return μ
end
- 1
-
β
is a vector, so we use a multivariate normal distribution. Here, we assume that the predictors are uncorrelated, so we use a diagonal covariance matrix. - 2
- We use a log-normal distribution for \(\sigma\) because it is a positive quantity. This is an alternative to a truncated Normal.
Uncomment the above model and then modify it to use better priors, iterating with the next step
5.1 Prior predictive check
= Matrix(runoff_df[:, Not(:runoff)]) # all columns except runoff
X = vec(runoff_df[!, :runoff]) # runoff
y
= let
ppc_multi ... # FILL IN
end
How should we visualize the prior predictive check? A simple way is to visualize the implied distribution of \(y\) given observed \(X\):
histogram(
vcat(lm_predict(ppc_multi, X, 500)...),
=L"$y$",
xlabel ylabel=L"Implied $p(y)$",
=false,
label=:pdf,
normalize=false,
yticks="Prior predictive check",
title )
Modify the above code to draw samples from the prior and save as ppc_multi
.
5.2 Inference
Now that our priors seem plausible (we could a;ways do better, but that’s not the point of this lab), we can fit the model.julia chn_multi = let ... end # summary stats
Draw the samples from the posterior and save as chn_multi
. Display the summary statistics. What do you learn?
5.3 Posterior predictive check
How should we visualize these inferences? It’s hard to plot a single \(x\) vs \(y\) when we have multiple \(x\). There are many possible plots, but a good one is to plot the predicted vs actual values of \(y\). To make the plot more readable, we’ll add some jitter to the data. Here’s a function to do that:
Code
function jitter(x, y; ϵx=0.1, ϵy=0.1)
= x .+ rand(Normal(0, ϵx), size(x))
x_jit = y .+ rand(Normal(0, ϵy), size(y))
y_jit return x_jit, y_jit
end
Now we can create our scatter plot.
= plot(; xlabel="Predicted Runoff", ylabel="Actual Runoff", aspect_ratio=1) # blank plot
p4 for i in 1:500 # loop through
= lm_predict(chn_multi, X, 1)
pred = jitter(pred, y; ϵx=0.1, ϵy=0.1)
xplt, yplt scatter!(xplt, yplt; color=:black, alpha=0.5, label=false, markersize=0.5)
end
abline!(p4, 1, 0; color=:red, label="1:1 line")
Plots. p4
The above block shouldn’t need to be modified. Add the curly brackets when you are ready. What do you learn from this plot?
6 Submission
As you did for previous labs, you will submit your lab by pushing it to GitHub. In addition, submit the DOCX or PDF file (as for lab 2) to Canvas.