Lab 07

Convergence, Central Limits, and Fat Tails

Module 3
Labs
Published

Fri., Nov. 3

1 Intro

An important idea in statistics is the Central Limit Theorem (CLT), which states (much more elegantly and precisely) that the sum of many independent random variables is approximately normally distributed, even if the underlying process is not randomly distributed (as long as it’s moderately well-behaved). This is often used to justify approximating the average of many data points, and the uncertainty in this average, as a normal distribution.

The CLT is a powerful tool, but intuition we develop about working with distributions that behave “nicely” can be misleading when we are working with extreme value distributions. The purpose of this lab is to explore this idea and to build some intuition about sampling uncertainty and sample statistics when working with extreme value distributions.

1.1 Setup

Remember to git clone to your machine, then activate and instantiate the environment, as we have done in previous labs. If you are having trouble, you may want to build IJulia in the Pkg manager.

Do not use additional packages for this assignment, though you are welcome to look at the documentation or implementation of other packages for nearest-neighbors methods.

using Distributions
using Plots
using StatsBase
using StatsPlots

Plots.default(; margin=4Plots.mm, size=(700, 400), linewidth=2)

2 Normal distribution

To build some intuition, let’s create a Normal distribution with a very large standard deviation relative to its mean:

dist_normal = Normal(2.5, 10)

We can calculate its mean and standard deviation

mean(dist_normal), std(dist_normal)
(2.5, 10.0)

Because we’re going to be plotting a lot of distributions, let’s define a function to do this for us:

function plot_dist(dist; name="", xlims=missing)
    ub = quantile(dist, 0.998)
    lb = quantile(dist, 0.002)
    p = plot(x -> pdf(dist, x); ylabel="Probability Density", label=name, xlims=(lb, ub))
    !ismissing(xlims) && xlims!(p, xlims)
    return p
end

plot_dist(dist_normal; name="Normal Distribution")

We can see that even though this distribution has a large standard deviation, it still has that classical “bell” shape. We can also sample from its probability density function and plot a histogram of those samples. That output should look like the probability density function shown above!

samples_normal = rand(dist_normal, 100_000)
histogram(samples_normal; normalize=:pdf, label="Samples", ylabel="Probability Density")
plot!(dist_normal; label="True", linewidth=4)

We can see that our large sample size leads to us being able to sample well from the mean.

2.1 Running statistics

An important idea in statistics is that we can calculate descriptive statistics from our data. For example, things like the mean, standard deviation, skewness, or 99th percentile. An interesting question to ask is how many observations we need to get a reliable estimate of these “sample statistics”.

Let’s start by looking at the mean with our 100,000 samples. We could do this with a loop, but it turns out to be inefficient. Instead, we’ll define some more efficient functions

cumul_mean(x) = cumsum(x) ./ (1:length(x))
cumul_std(x) = sqrt.(cumsum(x .- cum_mean(x)) .^ 2 ./ (1:length(x)))

Now let’s plot the cumulative mean of our samples.

plot(
    cumul_mean(samples_normal); label="Cumulative Mean", ylabel="Mean", xlabel="Sample Size"
)
hline!([mean(dist_normal)]; label="True Mean")