Fundamentals of probability distributions and statistics

Lecture

2023-08-28

A quick note on pacing

We will move through this module (“fundamentals”) at a fairly brisk pace

Review course slides
Ask questions on Canvas or in office hours
To help you learn to code, I am exposing you to code early and often
- I don’t expect that you are able to replicate all the code in this notebook
- I have added annotations where appropriate
- The labs will give you practice
- You will not need to write code from scratch for the exams

Packages in Julia

Today

Packages in Julia
Effect of beer drinking on attractiveness to mosquitos
Probability distributions
Statistics
Wrapup

What is a package?

Code that is bundled for easy use
Provides functionality that is not part of the base language
- Most stuff in Julia requires packages, as we will see
Need to be installed
Developed by the community

Where do I get packages?

Julia has a built-in package manager for installing add-on functionality written in Julia. It can also install external libraries using your operating system’s standard system for doing so, or by compiling from source.

Where are packages stored?

Each project has an *environment, which is defined by the following files (do not edit them manually):

Project.toml: lists the specified dependencies of the project
Manifest.toml: lists the exact versions of the packages that are used in the project

The actual packages are stored on your computer and you don’t need to worry

Workflow: activate

We activate a project to tell Julia that we want to use the packages in that project. These steps are equivalent:

Open the REPL
using Pkg
Pkg.activate(".")

Open the REPL
Press ] to enter the package manager
activate .

Workflow: install

We add a package to install it in the current project

Open the REPL
using Pkg
Pkg.add("DataFrames")

Open the REPL
Press ] to enter the package manager
add DataFrames

Workflow: instantiate

When working with someone else’s project, we need to install the packages that they use.

activate does not install anything, just tells Julia which packages to use
instantiate is your friend to make sure an environment is ready to use. If there’s nothing to do, instantiate does nothing.

Learn more

Pkg.jl docs
1. See “Using someone else’s projecg” for more on instantiate
Well-worked blog post by Julies Krumbiegel

Lab 01 issues

Be sure to submit your assignment
Canvas discussion: “Lab 01 Discussion”
ERROR: Jupyter kernel 'julia-1.9' not found. x4

Lab 01 fix

In order to run codes using Quarto, you need the IJulia package
- Listed in Manifest.toml but you need to instantiate
If that doesn’t work:
- Run Pkg.build("IJulia") in the REPL (after you activate and instantiate)
I’ve updated the instructions

Effect of beer drinking on attractiveness to mosquitos

Today

Packages in Julia
Effect of beer drinking on attractiveness to mosquitos
Probability distributions
Statistics
Wrapup

Stats without the agonizing details

In this class we will use computation and simulation to build fundamental insight into statistical processes without dwelling on “agonizing” details.

Motivating question

Does drinking beer reduce the likelihood of being bitten by mosquitos?

Raw data

Create a variable called beer to hold the number of mosquito bites for beer drinkers:

25-element Vector{Int64}:
 27
 20
 21
 26
 27
 31
 24
 21
 20
 19
 23
 24
 28
 19
 24
 29
 18
 20
 17
 31
 20
 25
 28
 21
 27

What is `beer`?

We can learn a bit more about it:

Vector{Int64} (alias for Array{Int64, 1})

(25,)

23.6

More raw data

We can do the same for water drinkers:

By putting the ; at the end of our statement, we keep the notebook from showing the output

A simple analysis

Let’s calculate the difference between the average number of bites in each group.

4.37777777777778

This gives us the mean function from the StatsBase package

The skeptic’s argument

The skeptic asks whether this might be random chance.

We could answer this with a T test
1. Determine if there is a significant difference between the means of two groups
2. Assumes (approximate) normality
3. Assumptions hidden behind a software package
Simulation approach:
1. Suppose the skeptic is right – the two groups are samped from the same distribution
2. Shuffle the data (randomly divide into two groups by assuming that there is no difference between the two groups)
3. Calculate the difference between each group
4. Repeat many times and examine the distribution of differences

Implementation

1.224444444444444

Use the shuffle function from the Random package
Define a function. Its arguments are y1 and y2
end closes the function definition
Call the function with our data

Running

We want to learn about the sampling distribution of the group differences: repeat this experiment many times over and plott the results

This is a list comprehension. It’s a way to create a list by looping over something. Here, we loop over the numbers 1 to 50,000 and call get_shuffled_difference each time.
length tells us the size of a vector

Plotting

We need the Plots package to make plots
Define a function. Its arguments are diffs and obs
histogram is a function from the Plots package
Create a histogram using the diffs object. ; separates the positional arguments from the keyword arguments
xlabel is a “keyword argument” specifying the text for the x-axis label
the y-axis label
the label to use in the legend
specify the bins to use in the histogram
specify the location of the legend
normalize the histogram so that the area under the curve is 1
add a vertical line (vline!) at the observed difference
many functions return their output – in this case the plot we created from the inputs

Alternative

We could have done this with a parametric test

We need the HypothesisTests package
We don’t need to include the HypothesisTests., but it adds clarity
Recall: ; suppresses output

Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          4.37778
    95% confidence interval: (1.913, 6.843)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0009

Details:
    number of observations:   [25,18]
    t-statistic:              3.5869843832143413
    degrees of freedom:       41
    empirical standard error: 1.220461900604875

Two sample t-test (unequal variance)
------------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          4.37778
    95% confidence interval: (1.957, 6.798)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0007

Details:
    number of observations:   [25,18]
    t-statistic:              3.658244539721401
    degrees of freedom:       39.11341478045414
    empirical standard error: 1.196688119190407

Discussion

This is called a bootstrap and is a very powerful tool in many situations
What would we expect to see if the skeptic was correct?
P-value:
- the likelihood of the data if the null hypothesis is correct
- skeptic’s (null) hypothesis: no difference between groups

0.00052

. is the dot operator. It applies the function to each element of the vector individually.

Probability distributions

Today

Packages in Julia
Effect of beer drinking on attractiveness to mosquitos
Probability distributions
Statistics
Wrapup

The Normal distribution

The Normal (Gaussian) distribution has probability distribution function:

\[ p(y | \mu, \theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left( -\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)^{\!2} \, \right) \]

Mean \(\mu\)
- Median equal to mean
Variance \(\sigma^2\)
Symmetric

Central limit theorem

The central limit theorem says that the sum of many independent random variables is approximately normally distributed.

We can see this with an example:

For each sample \(i = 1, \ldots, N\):
1. Draw J draws from a non-Gaussian distribution \(\mathcal{D}\)
2. Take the mean and save it as \(\bar{y}_i\)
Plot the distribution of \(\bar{y}_i\)

To type ȳ, type y then type \bar and hit tab. Julia allows unicode (or emojis) in variable names
To type ∈ , type \in and hit tab. The _ isn’t doing anything special and we could name it i or 😶 or whatever we want but _ suggests it’s a throwaway
This is another list comprehension

Notation

We will get tired of writing

\[ p(y | \mu, \theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left( -\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)^{\!2} \, \right) \]

Instead, we will often use shorthand:

\[ y \sim \mathcal{N}(\mu, \sigma^2) \]

Normal PDF

L"<string>" allows us to use LaTeX in strings
This notation specifies the values of \(\mu\) and \(\sigma\)

Bernoulli distribution

A Bernoulli distribution models a coin flip.

5-element Vector{Bool}:
 0
 0
 0
 0
 0

Draw 5 samples from the Bernoulli distribution with parameter p

Binomial distribution

A Binomial distribution models the distribution of n consecutive flips of the same coin

5-element Vector{Int64}:
 3
 4
 5
 1
 3

Multinomial distribution

The Multinomial extends the Binomial to multiple categories. Note that p is a vector. If there are 2 categories (\(K=2\)), it’s just the binomial with \(p_\text{multinomial} = [p, 1-p]\).”

3×5 Matrix{Int64}:
 3  3  0  4  2
 2  0  0  1  1
 0  2  5  0  2

To be more concise, we could write rand(Multinimial([0.5, 0.3, 0.2], 5), 5). Which is more readable?

Poisson distribution

The Poisson distribution is used to model count data. It is the limit of a Binomial distribution with \(p=\lambda/N\), as \(N \rightarrow \infty\).

A Poisson distribution has mean and variance equal to \(\lambda\).

10-element Vector{Int64}:
 1
 4
 4
 1
 3
 3
 3
 2
 2
 4

The Poisson distribution has one parameter, \(\lambda\)
Draw 10 samples from the Poisson distribution

Negative binomial distribution

The NegativeBinomaial distribution relaxes the Poisson’s assumotion that \(\text{mean} = \text{variance}\).

This distribution models the number of successes in a sequence of independent and identically distributed Bernoulli trials with probability p before a specified (non-random) number of failures (r) occurs. For example, we can define rolling a 6 on a dice as a failure, and rolling any other number as a success, and ask how many successful rolls will occur before we see the third failure (p = 1/6 and r = 3).

What other distributions do you know?

Uniform
Exponential
Gamma (see above)
Beta
Pareto
Student t
Boltzmann
Many more!

Statistics

Today

Packages in Julia
Effect of beer drinking on attractiveness to mosquitos
Probability distributions
Statistics
Wrapup

Mean

The mean of a sample is just the sample average: \[ \bar{y} = \frac{1}{N} \sum_{i=1}^N y_i \]

The mean of a distribution is the expected value of the distribution: \[ \mathbb{E}(u) = \int u p(u) \, du \]

Variance

Variance measures how points differ from the mean

You may be familiar with sample variance: \[ S^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1} \]

For a distribution: \[ \mathbb{V}(u) = \int (u - \mathbb{E}(u))^2 p(u) \, du \] or, for a vector \[ \mathbb{V}(u) = \int (u - \mathbb{E}(u)) (u - \mathbb{E}(u))^T p(u) \, du \]

Wrapup

Today

Packages in Julia
Effect of beer drinking on attractiveness to mosquitos
Probability distributions
Statistics
Wrapup

Coming up

Wednesday: working with probability distributions
Friday:
- Labs” 02: Working with tabular data in Julia
- Lab 01 due at start of class

Office hours

If you haven’t filled out the Doodle, please do so ASAP

Fundamentals of probability distributions and statistics

A quick note on pacing

Packages in Julia

What is a package?

Where do I get packages?

Where are packages stored?

Workflow: activate

Workflow: install

Workflow: instantiate

Learn more

Lab 01 issues

Lab 01 fix

Effect of beer drinking on attractiveness to mosquitos

Stats without the agonizing details

Motivating question

Raw data

What is beer?

More raw data

A simple analysis

The skeptic’s argument

Implementation

Running

Plotting

Alternative

Discussion

Probability distributions

The Normal distribution

Central limit theorem

Notation

Normal PDF

Bernoulli distribution

Binomial distribution

Multinomial distribution

Poisson distribution

Negative binomial distribution

What other distributions do you know?

Statistics

Mean

Variance

Wrapup

Coming up

Office hours

What is `beer`?