In the wake of a severe flood, an insurance company has comissioned you to study flood damage in one of the most-affected neighborhoods. This neighborhood is perfectly flat, so we can assume that all houses experienced the same flood depth. However, they are elevated to different heights, use different materials, and are built to different standards, and as a result they experienced different amounts of damage. Specifically, the insurance company has provided you with a dataset of the fraction of the value of each house that was lost in the flood and asked you to model the distribution of losses.
1.1 Mathematical model
Our model for the loss fraction \(y_i\) for the \(i\)th house is \[
y_i | \alpha, \beta \sim \text{Beta}(\alpha, \beta)
\] where \(\alpha\) and \(\beta\) are the shape parameters of the Beta Distribution. Because we are not including any explanatory information in our model, we are assuming that the distribution of loss fractions is the same for all houses. This is reasonable for our neighborhood, but would not be applicable to another neighborhood or a different flood event.
The above notation is shorthand for \[
p(y_i | \alpha, \beta) =
\frac{x^{\alpha-1}(1-x)^{\beta-1}} {\mathrm{B}(\alpha,\beta)}\!
\] where \(\mathrm {B} (\alpha ,\beta )={\frac {\Gamma (\alpha )\Gamma (\beta )}{\Gamma (\alpha +\beta )}}\) and \(\Gamma\) is the Gamma function. We will work directly with the pdf and logpdf functions in the Distributions package, so you don’t need to memorize this formula..
1.2 Setup
As for labs 1 and 2, make sure you follow the three standard initial setup steps:
Open the lab 3 folder in VS Code. Do NOT open a “parent” directory containing lab 3. If you’re not sure what folder you’re in, open the Juila REPL and type pwd(). It should say something like /.../lab03-username....
Activate the project environment: ] to enter package mode then activate .. Don’t forget the . at the end, it’s very important.
Install all required packages: ] to enter package mode, then instantiate.
At this point, check to make sure you can render the document. In VS Code, open the command palette (Windows: Ctrl+Shift+P, Mac: Cmd+Shift+P) and type Render: HTML. If this gives you trouble, try the following:
] in the Julia REPL to enter Package mode
build IJulia
If this still gives you trouble, ask for help on Canvas or in-person.
1.3 Package imports
As usual we start by specifying the packages we are using
When we loop through our values of α, we zip it with a vector of colors so that we can plot each value of α in a different color.
3
Similarly, we can attach each value of β to a different linestyle.
4
plot!(p, ...) will add more elements to the plot p that we defined above.
5
Using the StatsPlots package, we can plot a distribution by passing it to plot! with no additional arguments.
2 First steps
While the insurance company is aggregating and collecting all the data, they ask us to go ahead and get started developing a workflow. They fax (!!!) over the first three data points, rounded to two decimals, which we can read in as:
We use the joinpath function to join together the path to the data directory and the filename. This is a good practice to make sure that your code works on different operating systems.
2
We use the readdlm function from DelimitedFiles.jl to read the data. It is a Matrix bby default so we use [:, 1] to select all rows (:) and the first column (there is only 1).
3-element Vector{Float64}:
0.51
0.36
0.55
Recall that we can do things like
length(fax)
3
and
typeof(fax)
Vector{Float64} (alias for Array{Float64, 1})
to learn more about our data.
2.1 Likelihood model
functionlog_lik(y::Vector{T}, α::T, β::T) where {T<:Real}# fill in here# refer to lecture slides for syntax and function names# don't forget your `return` statement!end
Fill in the log_lik function where y is a vector of data, and α and β are the parameters of the Beta distribution. Then, convert it to live code by adding curly brackets, like you did in lab 02. Hint: define a function function log_lik(y::T, α::T, β::T) where {T<:Real}... that takes in a single point. Then this function that takes in a vector y can calculate the log-likelihood for each data point individually and combine them. This is not the only way to solve this problem
2.2 Check if it’s right
We can check your log_lik function by comparing what you calculate using it to a known, correct value.
your_val =log_lik([0.2, 0.4, 0.6, 0.8], 5., 5.) # calls your implementationtrue_value =-0.2947032775653282# I calculated this@assertisapprox(your_val, true_value) # checks if they're closeprintln("😁")
Convert this to a live code block by adding curly brackets, like you did in lab 02, and run. You should see a 😄.
2.3 Plot
Now that we have the log_likelihood, we’re going to plot it for many values of \(\alpha\) and \(\beta\).
First, let’s define a grid of values for \(\alpha\) and \(\beta\) to plot:
Define them to be different lengths so that if we make an indexing mistake, we should get an error.
Next, we can calculate the log likelihood at each point on the grid of α_plot and β_plot as:
log_lik_fax = [log_lik(fax, αi, βi) for αi in α_plot, βi in β_plot]# ensure that the dimensions are correct@assertsize(log_lik_fax) == (length(α_plot), length(β_plot))display(size(log_lik_fax))
Now, we’re ready to plot. You can use the following code, which provides some helpful keyword arguments to make your plot look nice. Feel free to play around with them.
Note that we plot log_lik_fax' instead of log_lik_fax. That is equivalent to transpose(log_lik_fax). This is a quirk of plots.jl syntax (see here). We avoid confusion by checking the size of log_lik_fax above.
Note that we have assigned our plot a variable name, p1. This will let us add elements to it later.
Add curly brackets to the code blocks in this section once your log_lik function is working.
2.4 Best estimates
What values of α_plot and β_plot that maximize the log likelihood? We can find out in several different ways.
2.4.1 Optimization on a grid
The easiest thing to do is to find the maximum value of the log likelihood on our grid. We can do that as follows:
idx_fax =argmax(log_lik_fax) # the index that maximizes the log likelihoodα_fax_best = α_plot[idx_fax[1]]β_fax_best = β_plot[idx_fax[2]]
Add curly brackets to the code blocks in this section once your log_lik function is working.
2.4.2Optim.jl
We can also use the optimize function from Optim.jl to find the maximum value of the log likelihood.
# define the function to be optimizedloss_fax(θ) =# define the negative log-likelihood here, using your `log_lik` function and passing in `fax` as the `y` argument.lower = [0.001, 0.001] # lower bound -- don't need to changeupper = [Inf, Inf] # upper bound -- don't need to editguess = [1.0, 1.0] # initial guess -- leave as-isres =optimize(loss_fax, lower, upper, guess) # run the optimizationθ_fax = Optim.minimizer(res) # get the best parameters
Add curly brackets to the code blocks in this section once you have implemented everything. Follow the commented instructions.
2.4.3Distributions.jl
As an alternative to Optim.jl, we can use the fit_mle function from Distributions.jl for most distributions. We can use this to check our work:
Distributions.fit_mle(Beta, fax)
Add curly brackets to the code blocks in this section once you have implemented everything above. Compare the fit_mle result to your result using optimize.
2.5 Update the plot
These best estimates should show up as points on our plot. We can add them as follows:
When we add a single point, we have to wrap it in brackets [] to make it a vector. The x and y inputs to plot (or scatter!) need to be vectors.
Add curly brackets to the code blocks in this section once you have implemented everything above.
3 Email data
Pleased with our preliminary analysis, the insurance company emails us another batch of data. This includes 20 observations (our original 3 plus 17 more), still rounded to two decimals, which we can read in as:
display does the same thing as just typing the variable name, but it’s useful when you want to display multiple things at once.
20
Vector{Float64} (alias for Array{Float64, 1})
They ask us to regenerate the plot with this new data. Because we know that the full dataset will be arriving soon, and we don’t want to copy and paste everything three times, we decide to write a function that takes in the data and returns:
The plot
The best estimates from the grid search
The maximum likelihood estimates
We can define such a function as follows:
functioninsurance_analysis(y::Vector{T}) where {T<:Real}# fill in here# refer to lecture slides for syntax and function names# don't forget your `return` statement!# DEFINE THE GRID TO SEARCH OVER α_plot =exp.(-4:0.05:4.05) # good to define this inside the function β_plot =exp.(-4:0.05:4) # the exp ensures > 0# CALCULATE THE LIKELIHOOD ON THE GRID USING THE log_like function defined above# hint: use the list comprehension syntax insurance_log_lik =#...# CALCULATE THE BEST PARAMETER FROM THE GRID SEARCH θ_gridsearch =# ...# CALCULATE THE MAXIMUM LIKELIHOOD ESTIMATE USING OPTIMIZATION θ_MLE =# ...# build the plot (definitely multiple lines --# put each argument on a new line as above for clarity plt =plot( α_plot, β_plot, insurance_log_lik;...# add additional keyword arguments here )# CONCLUDE WITH A RETURN STATEMENT# we are allowed to return more than one thing (technically a `Tuple`)return plt, θ_MLE, θ_gridsearchend;
We can check this analysis on our fax dataset and make sure it’s the same as what we did above:
Implement the insurance_analysis function above. You should copy and paste the code from above, but replace fax with y and update other variable names as appropriate. The plot should show the log-likelihood you implemented, as we did above, and should also include the MLE and grid-search estimate as clearly labeled points.
4 All data
Eventually, the insurance company sends us all the data they have collected. This includes 1000 observations with no rounding, which we can read in as:
Look at the three plots and compare them. What is happening to the function \(\log p(y | \alpha, \beta)\) as we collect more data? What is happening to our point estimates?
6 Submission instructions
As you did for lab 2, you will submit your lab by pushing it to GitHub. In addition, submit the DOCX file (as for lab 2) to Canvas.