```
25-element Vector{Int64}:
27
20
21
26
27
31
24
21
20
19
23
24
28
19
24
29
18
20
17
31
20
25
28
21
27
```

Lecture

2023-08-28

We will move through this module (“fundamentals”) at a fairly brisk pace

- Review course slides
- Ask questions on Canvas or in office hours
- To help you learn to code, I am exposing you to code early and often
- I don’t expect that you are able to replicate all the code in this notebook
- I have added annotations where appropriate
- The labs will give you practice
- You will not need to write code from scratch for the exams

Today

Packages in Julia

Effect of beer drinking on attractiveness to mosquitos

Probability distributions

Statistics

Wrapup

- Code that is bundled for easy use
- Provides functionality that is not part of the base language
- Most stuff in Julia requires packages, as we will see

- Need to be installed
- Developed by the community

Julia has a built-in package manager for installing add-on functionality written in Julia. It can also install external libraries using your operating system’s standard system for doing so, or by compiling from source.

Each *project* has an *environment, which is defined by the following files (do not edit them manually):

`Project.toml`

: lists the*specified dependencies*of the project`Manifest.toml`

: lists the*exact versions*of the packages that are used in the project

The actual packages are stored on your computer and you don’t need to worry

We `activate`

a project to tell Julia that we want to use the packages in that project. These steps are equivalent:

- Open the REPL
`using Pkg`

`Pkg.activate(".")`

- Open the REPL
- Press
`]`

to enter the package manager `activate .`

We `add`

a package to install it in the current project

- Open the REPL
`using Pkg`

`Pkg.add("DataFrames")`

- Open the REPL
- Press
`]`

to enter the package manager `add DataFrames`

When working with someone else’s project, we need to install the packages that they use.

`activate`

does not install anything, just tells Julia which packages to use`instantiate`

is your friend to make sure an environment is ready to use. If there’s nothing to do, instantiate does nothing.

`Pkg.jl`

docs- See “Using someone else’s projecg” for more on
`instantiate`

- See “Using someone else’s projecg” for more on
- Well-worked blog post by Julies Krumbiegel

- Be sure to submit your assignment
- Canvas discussion: “Lab 01 Discussion”
`ERROR: Jupyter kernel 'julia-1.9' not found.`

x4

- In order to run codes using Quarto, you need the
`IJulia`

package- Listed in
`Manifest.toml`

but you need to`instantiate`

- Listed in
- If that doesn’t work:
- Run
`Pkg.build("IJulia")`

in the REPL (after you`activate`

and`instantiate`

)

- Run
- I’ve updated the instructions

Today

Packages in Julia

Effect of beer drinking on attractiveness to mosquitos

Probability distributions

Statistics

Wrapup

In this class we will use computation and simulation to build fundamental insight into statistical processes without dwelling on “agonizing” details.

Does drinking beer reduce the likelihood of being bitten by mosquitos?

Create a *variable* called `beer`

to hold the number of mosquito bites for beer drinkers:

```
25-element Vector{Int64}:
27
20
21
26
27
31
24
21
20
19
23
24
28
19
24
29
18
20
17
31
20
25
28
21
27
```

`beer`

?We can learn a bit more about it:

Vector{Int64} (alias for Array{Int64, 1})

`25`

`(25,)`

`23.6`

We can do the same for water drinkers:

- By putting the
`;`

at the end of our statement, we keep the notebook from showing the output

Let’s calculate the difference between the average number of bites in each group.

`4.37777777777778`

- This gives us the
`mean`

function from the`StatsBase`

package

The skeptic asks whether this might be random chance.

- We could answer this with a T test
- Determine if there is a significant difference between the means of two groups
- Assumes (approximate) normality
- Assumptions hidden behind a software package

- Simulation approach:
- Suppose the skeptic is right – the two groups are samped from the same distribution
- Shuffle the data (randomly divide into two groups by assuming that there is no difference between the two groups)
- Calculate the difference between each group
- Repeat many times and examine the distribution of differences

`1.224444444444444`

- Use the
`shuffle`

function from the`Random`

package - Define a function. Its arguments are
`y1`

and`y2`

`end`

closes the function definition- Call the function with our data

We want to learn about the *sampling distribution* of the group differences: repeat this experiment many times over and plott the results

`50000`

- This is a
*list comprehension*. It’s a way to create a list by looping over something. Here, we loop over the numbers 1 to 50,000 and call`get_shuffled_difference`

each time. `length`

tells us the size of a vector

- We need the
`Plots`

package to make plots - Define a function. Its arguments are
`diffs`

and`obs`

`histogram`

is a function from the`Plots`

package- Create a histogram using the
`diffs`

object.`;`

separates the positional arguments from the keyword arguments `xlabel`

is a “keyword argument” specifying the text for the x-axis label- the y-axis label
- the label to use in the legend
- specify the bins to use in the histogram
- specify the location of the legend
- normalize the histogram so that the area under the curve is 1
- add a vertical line (
`vline!`

) at the observed difference - many functions
*return*their output – in this case the plot we created from the inputs

We could have done this with a parametric test

- We need the
`HypothesisTests`

package - We don’t need to include the
`HypothesisTests.`

, but it adds clarity - Recall:
`;`

suppresses output

```
Two sample t-test (equal variance)
----------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: 4.37778
95% confidence interval: (1.913, 6.843)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0009
Details:
number of observations: [25,18]
t-statistic: 3.5869843832143413
degrees of freedom: 41
empirical standard error: 1.220461900604875
```

```
Two sample t-test (unequal variance)
------------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: 4.37778
95% confidence interval: (1.957, 6.798)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0007
Details:
number of observations: [25,18]
t-statistic: 3.658244539721401
degrees of freedom: 39.11341478045414
empirical standard error: 1.196688119190407
```

- This is called a
*bootstrap*and is a very powerful tool in many situations - What would we expect to see if the skeptic was correct?
- P-value:
- the
*likelihood*of the data if the*null hypothesis*is correct - skeptic’s (null) hypothesis: no difference between groups

- the

`0.00052`

`.`

is the*dot operator*. It applies the function to each element of the vector individually.

Today

Packages in Julia

Effect of beer drinking on attractiveness to mosquitos

Probability distributions

Statistics

Wrapup

The Normal (Gaussian) distribution has *probability distribution function*:

\[ p(y | \mu, \theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left( -\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)^{\!2} \, \right) \]

- Mean \(\mu\)
- Median equal to mean

- Variance \(\sigma^2\)
- Symmetric

The *central limit theorem* says that the sum of many independent random variables is approximately normally distributed.

We can see this with an example:

- For each sample \(i = 1, \ldots, N\):
- Draw
`J`

draws from a non-Gaussian distribution \(\mathcal{D}\) - Take the mean and save it as \(\bar{y}_i\)

- Draw
- Plot the distribution of \(\bar{y}_i\)

- To type
`ȳ`

, type`y`

then type`\bar`

and hit`tab`

. Julia allows unicode (or emojis) in variable names - To type
`∈`

, type`\in`

and hit`tab`

. The`_`

isn’t doing anything special and we could name it`i`

or 😶 or whatever we want but`_`

suggests it’s a throwaway - This is another
*list comprehension*

We will get tired of writing

\[ p(y | \mu, \theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left( -\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)^{\!2} \, \right) \]

Instead, we will often use shorthand:

\[ y \sim \mathcal{N}(\mu, \sigma^2) \]

`L"<string>"`

allows us to use LaTeX in strings- This notation specifies the values of \(\mu\) and \(\sigma\)

A Bernoulli distribution models a coin flip.

```
5-element Vector{Bool}:
0
0
0
0
0
```

- Draw 5 samples from the Bernoulli distribution with parameter
`p`

A Binomial distribution models the distribution of `n`

consecutive flips of the same coin

```
5-element Vector{Int64}:
3
4
5
1
3
```

The Multinomial extends the Binomial to multiple categories. Note that `p`

is a *vector*. If there are 2 categories (\(K=2\)), it’s just the binomial with \(p_\text{multinomial} = [p, 1-p]\).”

```
3×5 Matrix{Int64}:
3 3 0 4 2
2 0 0 1 1
0 2 5 0 2
```

- To be more concise, we could write
`rand(Multinimial([0.5, 0.3, 0.2], 5), 5)`

. Which is more readable?

The Poisson distribution is used to model count data. It is the limit of a Binomial distribution with \(p=\lambda/N\), as \(N \rightarrow \infty\).

A Poisson distribution has mean and variance equal to \(\lambda\).

```
10-element Vector{Int64}:
1
4
4
1
3
3
3
2
2
4
```

- The Poisson distribution has one parameter, \(\lambda\)
- Draw 10 samples from the Poisson distribution

The `NegativeBinomaial`

distribution relaxes the Poisson’s assumotion that \(\text{mean} = \text{variance}\).

This distribution models the number of successes in a sequence of independent and identically distributed Bernoulli trials with probability `p`

before a specified (non-random) number of failures (`r`

) occurs. For example, we can define rolling a 6 on a dice as a failure, and rolling any other number as a success, and ask how many successful rolls will occur before we see the third failure (`p = 1/6`

and `r = 3`

).

- Uniform
- Exponential
- Gamma (see above)
- Beta
- Pareto
- Student t
- Boltzmann
- Many more!

Today

Packages in Julia

Effect of beer drinking on attractiveness to mosquitos

Probability distributions

Statistics

Wrapup

The mean of a sample is just the sample average: \[ \bar{y} = \frac{1}{N} \sum_{i=1}^N y_i \]

The mean of a distribution is the expected value of the distribution: \[ \mathbb{E}(u) = \int u p(u) \, du \]

Variance measures how points differ from the mean

You may be familiar with sample variance: \[ S^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1} \]

For a distribution: \[ \mathbb{V}(u) = \int (u - \mathbb{E}(u))^2 p(u) \, du \] or, for a vector \[ \mathbb{V}(u) = \int (u - \mathbb{E}(u)) (u - \mathbb{E}(u))^T p(u) \, du \]

Today

Packages in Julia

Effect of beer drinking on attractiveness to mosquitos

Probability distributions

Statistics

Wrapup

- Wednesday: working with probability distributions
- Friday:
- Labs” 02: Working with tabular data in Julia
- Lab 01 due
*at start of class*

If you haven’t filled out the Doodle, please do so ASAP