Fundamentals of probability distributions and statistics


Lecture

2023-08-28

A quick note on pacing

We will move through this module (“fundamentals”) at a fairly brisk pace

  • Review course slides
  • Ask questions on Canvas or in office hours
  • To help you learn to code, I am exposing you to code early and often
    • I don’t expect that you are able to replicate all the code in this notebook
    • I have added annotations where appropriate
    • The labs will give you practice
    • You will not need to write code from scratch for the exams

Packages in Julia

Today

  1. Packages in Julia

  2. Effect of beer drinking on attractiveness to mosquitos

  3. Probability distributions

  4. Statistics

  5. Wrapup

What is a package?

  • Code that is bundled for easy use
  • Provides functionality that is not part of the base language
    • Most stuff in Julia requires packages, as we will see
  • Need to be installed
  • Developed by the community

Where do I get packages?

Julia has a built-in package manager for installing add-on functionality written in Julia. It can also install external libraries using your operating system’s standard system for doing so, or by compiling from source.

Where are packages stored?

Each project has an *environment, which is defined by the following files (do not edit them manually):

  • Project.toml: lists the specified dependencies of the project
  • Manifest.toml: lists the exact versions of the packages that are used in the project

The actual packages are stored on your computer and you don’t need to worry

Workflow: activate

We activate a project to tell Julia that we want to use the packages in that project. These steps are equivalent:

  1. Open the REPL
  2. using Pkg
  3. Pkg.activate(".")
  1. Open the REPL
  2. Press ] to enter the package manager
  3. activate .

Workflow: install

We add a package to install it in the current project

  1. Open the REPL
  2. using Pkg
  3. Pkg.add("DataFrames")
  1. Open the REPL
  2. Press ] to enter the package manager
  3. add DataFrames

Workflow: instantiate

When working with someone else’s project, we need to install the packages that they use.

  • activate does not install anything, just tells Julia which packages to use
  • instantiate is your friend to make sure an environment is ready to use. If there’s nothing to do, instantiate does nothing.

Learn more

  1. Pkg.jl docs
    1. See “Using someone else’s projecg” for more on instantiate
  2. Well-worked blog post by Julies Krumbiegel

Lab 01 issues

  • Be sure to submit your assignment
  • Canvas discussion: “Lab 01 Discussion”
  • ERROR: Jupyter kernel 'julia-1.9' not found. x4

Lab 01 fix

  • In order to run codes using Quarto, you need the IJulia package
    • Listed in Manifest.toml but you need to instantiate
  • If that doesn’t work:
    • Run Pkg.build("IJulia") in the REPL (after you activate and instantiate)
  • I’ve updated the instructions

Effect of beer drinking on attractiveness to mosquitos

Today

  1. Packages in Julia

  2. Effect of beer drinking on attractiveness to mosquitos

  3. Probability distributions

  4. Statistics

  5. Wrapup

Stats without the agonizing details

In this class we will use computation and simulation to build fundamental insight into statistical processes without dwelling on “agonizing” details.

Motivating question

Does drinking beer reduce the likelihood of being bitten by mosquitos?

Raw data

Create a variable called beer to hold the number of mosquito bites for beer drinkers:

25-element Vector{Int64}:
 27
 20
 21
 26
 27
 31
 24
 21
 20
 19
 23
 24
 28
 19
 24
 29
 18
 20
 17
 31
 20
 25
 28
 21
 27

What is beer?

We can learn a bit more about it:

Vector{Int64} (alias for Array{Int64, 1})
25
(25,)
23.6

More raw data

We can do the same for water drinkers:

  1. By putting the ; at the end of our statement, we keep the notebook from showing the output

A simple analysis

Let’s calculate the difference between the average number of bites in each group.

4.37777777777778
  1. This gives us the mean function from the StatsBase package

The skeptic’s argument

The skeptic asks whether this might be random chance.

  1. We could answer this with a T test
    1. Determine if there is a significant difference between the means of two groups
    2. Assumes (approximate) normality
    3. Assumptions hidden behind a software package
  2. Simulation approach:
    1. Suppose the skeptic is right – the two groups are samped from the same distribution
    2. Shuffle the data (randomly divide into two groups by assuming that there is no difference between the two groups)
    3. Calculate the difference between each group
    4. Repeat many times and examine the distribution of differences

Implementation

1.224444444444444
  1. Use the shuffle function from the Random package
  2. Define a function. Its arguments are y1 and y2
  3. end closes the function definition
  4. Call the function with our data

Running

We want to learn about the sampling distribution of the group differences: repeat this experiment many times over and plott the results

50000
  1. This is a list comprehension. It’s a way to create a list by looping over something. Here, we loop over the numbers 1 to 50,000 and call get_shuffled_difference each time.
  2. length tells us the size of a vector

Plotting

  1. We need the Plots package to make plots
  2. Define a function. Its arguments are diffs and obs
  3. histogram is a function from the Plots package
  4. Create a histogram using the diffs object. ; separates the positional arguments from the keyword arguments
  5. xlabel is a “keyword argument” specifying the text for the x-axis label
  6. the y-axis label
  7. the label to use in the legend
  8. specify the bins to use in the histogram
  9. specify the location of the legend
  10. normalize the histogram so that the area under the curve is 1
  11. add a vertical line (vline!) at the observed difference
  12. many functions return their output – in this case the plot we created from the inputs

Alternative

We could have done this with a parametric test

  1. We need the HypothesisTests package
  2. We don’t need to include the HypothesisTests., but it adds clarity
  3. Recall: ; suppresses output
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          4.37778
    95% confidence interval: (1.913, 6.843)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0009

Details:
    number of observations:   [25,18]
    t-statistic:              3.5869843832143413
    degrees of freedom:       41
    empirical standard error: 1.220461900604875
Two sample t-test (unequal variance)
------------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          4.37778
    95% confidence interval: (1.957, 6.798)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0007

Details:
    number of observations:   [25,18]
    t-statistic:              3.658244539721401
    degrees of freedom:       39.11341478045414
    empirical standard error: 1.196688119190407

Discussion

  • This is called a bootstrap and is a very powerful tool in many situations
  • What would we expect to see if the skeptic was correct?
  • P-value:
    • the likelihood of the data if the null hypothesis is correct
    • skeptic’s (null) hypothesis: no difference between groups
0.00052
  1. . is the dot operator. It applies the function to each element of the vector individually.

Probability distributions

Today

  1. Packages in Julia

  2. Effect of beer drinking on attractiveness to mosquitos

  3. Probability distributions

  4. Statistics

  5. Wrapup

The Normal distribution

The Normal (Gaussian) distribution has probability distribution function:

\[ p(y | \mu, \theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left( -\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)^{\!2} \, \right) \]

  • Mean \(\mu\)
    • Median equal to mean
  • Variance \(\sigma^2\)
  • Symmetric

Central limit theorem

The central limit theorem says that the sum of many independent random variables is approximately normally distributed.

We can see this with an example:

  1. For each sample \(i = 1, \ldots, N\):
    1. Draw J draws from a non-Gaussian distribution \(\mathcal{D}\)
    2. Take the mean and save it as \(\bar{y}_i\)
  2. Plot the distribution of \(\bar{y}_i\)
  1. To type , type y then type \bar and hit tab. Julia allows unicode (or emojis) in variable names
  2. To type , type \in and hit tab. The _ isn’t doing anything special and we could name it i or 😶 or whatever we want but _ suggests it’s a throwaway
  3. This is another list comprehension

Notation

We will get tired of writing

\[ p(y | \mu, \theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left( -\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)^{\!2} \, \right) \]

Instead, we will often use shorthand:

\[ y \sim \mathcal{N}(\mu, \sigma^2) \]

Normal PDF

  1. L"<string>" allows us to use LaTeX in strings
  2. This notation specifies the values of \(\mu\) and \(\sigma\)

Bernoulli distribution

A Bernoulli distribution models a coin flip.

5-element Vector{Bool}:
 0
 0
 0
 0
 0
  1. Draw 5 samples from the Bernoulli distribution with parameter p

Binomial distribution

A Binomial distribution models the distribution of n consecutive flips of the same coin

5-element Vector{Int64}:
 3
 4
 5
 1
 3

Multinomial distribution

The Multinomial extends the Binomial to multiple categories. Note that p is a vector. If there are 2 categories (\(K=2\)), it’s just the binomial with \(p_\text{multinomial} = [p, 1-p]\).”

3×5 Matrix{Int64}:
 3  3  0  4  2
 2  0  0  1  1
 0  2  5  0  2
  1. To be more concise, we could write rand(Multinimial([0.5, 0.3, 0.2], 5), 5). Which is more readable?

Poisson distribution

The Poisson distribution is used to model count data. It is the limit of a Binomial distribution with \(p=\lambda/N\), as \(N \rightarrow \infty\).

A Poisson distribution has mean and variance equal to \(\lambda\).

10-element Vector{Int64}:
 1
 4
 4
 1
 3
 3
 3
 2
 2
 4
  1. The Poisson distribution has one parameter, \(\lambda\)
  2. Draw 10 samples from the Poisson distribution

Negative binomial distribution

The NegativeBinomaial distribution relaxes the Poisson’s assumotion that \(\text{mean} = \text{variance}\).

This distribution models the number of successes in a sequence of independent and identically distributed Bernoulli trials with probability p before a specified (non-random) number of failures (r) occurs. For example, we can define rolling a 6 on a dice as a failure, and rolling any other number as a success, and ask how many successful rolls will occur before we see the third failure (p = 1/6 and r = 3).

What other distributions do you know?

  • Uniform
  • Exponential
  • Gamma (see above)
  • Beta
  • Pareto
  • Student t
  • Boltzmann
  • Many more!

Statistics

Today

  1. Packages in Julia

  2. Effect of beer drinking on attractiveness to mosquitos

  3. Probability distributions

  4. Statistics

  5. Wrapup

Mean

The mean of a sample is just the sample average: \[ \bar{y} = \frac{1}{N} \sum_{i=1}^N y_i \]

The mean of a distribution is the expected value of the distribution: \[ \mathbb{E}(u) = \int u p(u) \, du \]

Variance

Variance measures how points differ from the mean

You may be familiar with sample variance: \[ S^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1} \]

For a distribution: \[ \mathbb{V}(u) = \int (u - \mathbb{E}(u))^2 p(u) \, du \] or, for a vector \[ \mathbb{V}(u) = \int (u - \mathbb{E}(u)) (u - \mathbb{E}(u))^T p(u) \, du \]

Wrapup

Today

  1. Packages in Julia

  2. Effect of beer drinking on attractiveness to mosquitos

  3. Probability distributions

  4. Statistics

  5. Wrapup

Coming up

  • Wednesday: working with probability distributions
  • Friday:
    • Labs” 02: Working with tabular data in Julia
    • Lab 01 due at start of class

Office hours

If you haven’t filled out the Doodle, please do so ASAP