```{julia}
Lab 02
Working with tabular data in Julia
1 Overview
In this lab we will learn how to work with tabular data in Julia. Specifically, you will get some experience using:
DataFrames.jl
to store tabular data as a DataFrameCSV.jl
to read CSV files and convert them to DataFramesDataFramesMeta.jl
to manipulate DataFramesPlots.jl
andStatsPlots.jl
to create visualizations
1.1 Exploratory modeling
Technically, what we are doing today is called exploratory modeling or exploratory data analysis. The latter is more common, but the former makes clear that all visualizations include some (usually implicit) conceptual model of the data. You will find more resources if you search for “exploratory data analysis”, however.
1.2 How to use this template
I have provided you with a lot of template code.
Any code block that starts with
is a “live” code block and will execute (run).
- These blocks will automatically run when you render the document
- You can click “Run Cell” to run the cell
- Make sure you understand what the code does and ask questions (in-person or on Canvas) if you don’t
- Do not change these blocks unless instructed
Any code block that starts with
```julia
is source code and will not run.
- You should change these blocks to complete the lab, filling in blanks and adding code as needed.
- To make it a cell that you can run, add the brackets
Changes you are expected to make are marked as follows:
This is an instruction.
1.3 Getting help
You may find the following Quarto documentation pages helpful:
Ask questions in class or post them on the Lab02 discussion on Canvas. The sections of this lab are numbered, so refer to the number when asking questions on Canvas to make it easier for us to help you.
2 Setup
2.1 Clone the respository
First, you’ll need to clone this repository to your computer. As with Lab 01, I recommend to use GitHub Desktop or the built-in Git support in VS Code. Remember to use the link from Canvas (classroom.github.com/...
).
2.2 Install required packages
- In VS Code, open the command palette (Windows:
Ctrl+Shift+P
, Mac:Cmd+Shift+P
) and selectJulia: Start REPL
. - In the Julia REPL, type
]
to enter the package manager. - Type
activate .
to activate the project environment. - Type
instantiate
to install the required packages. This may take a moment.1
2.3 Using
statement
In Julia we say using
to import a package. Typically we want to do this as early as possible in a script or notebook.
using CSV
using DataFrames
using DataFramesMeta
using Dates
using Plots
using StatsBase: mean
using StatsPlots
using Unitful
That this is a code block code and will run. If you have not yet installed the packages, you will see an error message. Don’t change this block – see instructions to install and instantiate.
2.4 Check
To make sure everything is working, you should open the command palette and select Quarto: Render HTML
. This will generate a HTML file from this notebook. This is a good way to check that everything is working before you start editing code.
3 Reading data
We will use the CSV.jl
package to read in our data. First, let’s define the file name
= "data/tidesandcurrents-8638610-1928-NAVD-GMT-metric.csv"; fname
- 1
- Recall the semi-colon at the end of the line suppresses output. This is a good habit to get into when defining variables.
Next, we will use the CSV.jl
package to read in the data.
= CSV.read(fname, DataFrame)
df first(df, 5)
- 1
-
CSV.read
is a function that takes two arguments: the file name and the type of object to read the data into. In this case, we want aDataFrame
. - 2
-
first
is a function that takes two arguments: the object to look at and the number of items to display. In this case, we want to look at the first 5 rows of theDataFrame
.
Row | Date Time | Water Level | Sigma | I | L |
---|---|---|---|---|---|
String31 | Float64 | Float64 | Int64 | Int64 | |
1 | 1928-01-01 00:00 | -0.547 | 0.0 | 0 | 0 |
2 | 1928-01-01 01:00 | -0.699 | 0.0 | 0 | 0 |
3 | 1928-01-01 02:00 | -0.73 | 0.0 | 0 | 0 |
4 | 1928-01-01 03:00 | -0.669 | 0.0 | 0 | 0 |
5 | 1928-01-01 04:00 | -0.516 | 0.0 | 0 | 0 |
This data comes from the NOAA Tides and Currents website, specifically for a station at Sewells Point, VA for the year 1928. NAVD refers to the North American Vertical Datum, which is a reference point for measuring sea level, and GMT refers to Greenwich Mean Time, which is the time zone used in the data (rather than local time).
3.1 Dates
We can see that our DataFrame has five columns, the first of which is “Date Time”. However, the “Date Time” column is being parsed as a string
. We want it to be a DateTime
object from the Dates
package. To do that, we need to tell Julia how the dates are formatted. We could then manually convert, but CSV.read
has a kewyord argument that we can use
= "yyyy-mm-dd HH:MM"
date_format = CSV.read(fname, DataFrame; dateformat=date_format)
df first(df, 3)
- 1
-
This is a string that tells Julia how the dates are formatted. For example,
1928-01-01 00:00
. See the documentation for more information. - 2
-
dateformat
is a keyword argument whiledate_format
is a variable whose value is"yyyy-mm-dd HH:MM"
. We could equivalently writedateformat="yyyy-mm-dd HH:MM"
.
Row | Date Time | Water Level | Sigma | I | L |
---|---|---|---|---|---|
DateTime | Float64 | Float64 | Int64 | Int64 | |
1 | 1928-01-01T00:00:00 | -0.547 | 0.0 | 0 | 0 |
2 | 1928-01-01T01:00:00 | -0.699 | 0.0 | 0 | 0 |
3 | 1928-01-01T02:00:00 | -0.73 | 0.0 | 0 | 0 |
3.2 Water levels
The next column is “Water Level”, which is the height of the water above the reference point (NAVD) in meters. We can see that this is being parsed as a float, which is what we want 👍. However, you have to know that the data is in meters rather than inches or feet or something else. To explicitly add information about the units, we can use the Unitful
package.
" Water Level"] .*= 1u"m"
df[!, first(df, 3)
Row | Date Time | Water Level | Sigma | I | L |
---|---|---|---|---|---|
DateTime | Quantity… | Float64 | Int64 | Int64 | |
1 | 1928-01-01T00:00:00 | -0.547 m | 0.0 | 0 | 0 |
2 | 1928-01-01T01:00:00 | -0.699 m | 0.0 | 0 | 0 |
3 | 1928-01-01T02:00:00 | -0.73 m | 0.0 | 0 | 0 |
- We select the column with water levels using its name. The
!
means “all rows”. Thus,df[!, " Water Level"]
is a vector of all the water levels stored. *=
means to multiply in place. For example, ifx=2
thenx *= 2
is equivalent tox = x * 2
..*=
is a vector syntax, meaning do the multiplication to each element of the vector individually.1u"m"
is aUnitful
object that represents 1 meter. We multiply the water levels by this to convert them to meters.
3.3 Subsetting and renaming
We want to only keep the first two (for more on the other three, see here). We can also rename the columns to make them easier to work with (spaces in variable names are annoying). To do this, we use the @rename
function:
= @rename(df, :datetime = $"Date Time", :lsl = $" Water Level"); df
- 1
-
The
$
is needed here because the right hand side is a string, not a symbol.
Then, we can use the @select
function to do select the columns we want. Notice how the first argument to select
is the DataFrame
and the subsequent arguments are column names. Notice also that our column names were strings ("Date Time"
), but we can also use symbols (:datetime
).
= @select(df, :datetime, :lsl)
df first(df, 3)
Row | datetime | lsl |
---|---|---|
DateTime | Quantity… | |
1 | 1928-01-01T00:00:00 | -0.547 m |
2 | 1928-01-01T01:00:00 | -0.699 m |
3 | 1928-01-01T02:00:00 | -0.73 m |
For more on what DataFramesMeta
can do, see this Tweet.
3.4 Writing a function
We have just done a lot of work to read in our data. However, this just gives us data for the year 1928. In fact, we have a CSV file for each year 1928-2021. To make sure we can read them each in excatly the same way, we want to write a function.
function read_tides(year::Int)
= "data/tidesandcurrents-8638610-$(year)-NAVD-GMT-metric.csv" # don't change this
fname = "yyyy-mm-dd HH:MM" # don't change this
date_format # your code here
# 1. read in the CSV file and save as a dataframe
# 2. convert the "Date Time" column to a DateTime object
# 3. convert the " Water Level" column to meters
# 4. rename the columns to "datetime" and "lsl"
# 5. select the "datetime" and "lsl" columns
# 6. return the dataframe
end
# print out the first 10 rows of the 1928 data
first(read_tides(1928), 10)
Fill out this function. Your function should implement the six steps indicated in the instructions. When it’s done, convert it to a live code block with: ```{julia}. When you run this code, it should print out the first 10 rows of the 1928 data. Make sure they look right!
3.5 Combining files
Now that we have the ability to read in the data corresponding to any year, we can read them all in and combine into a single DataFrame
. First, let’s read in all the data.
= 1928:2021 # all the years of data
years = # 1. call the read_tides function on each year
annual_data typeof(annual_data) # should be a vector of DataFrames
Next, we’ll use the vcat
function to combine all the data into a single DataFrame
.
= vcat(annual_data...) # don't change this
df first(df, 5)
last(df, 5) # check the last 5 years
- Call the
read_tides
function on each year - Turn the two code blocks in this section into live code blocks
- Run the code and make sure the first 5 rows and last five rows look right
4 Plots
Now we’ll create some plots of our data.
We have followed a bad naming convention for a good reason. When we read in the 1928 data we called it df
. When we read in the data from all years, we also called it df
(which will overwrite the old definition of the variable). This is normally bad practice! We want use clear, descriptive, and specific variable names! However, it means that we can make plots here using the 1928 data. Once you are able to read in the data for all years, these plots will automatically update to show data for all years. If you want to rename these variables, please go for it!
4.1 Time series plot
Let’s start with a simple time series plot of the water levels. Our data is collected hourly, so we have a lot of data points! Still, we can plot them all.
plot(
df.datetime,
df.lsl;="Water levels at Sewells Point, VA",
title="Water level",
ylabel=false,
label )
- 1
-
Because we are using the
Unitful
pacakge, the y-axis label will automatically include the units!!! - 2
- We are only plotting one “series” (data set), so we don’t need a legend.
4.2 Zooming in
Focusing on the entire time series means we can’t dig into the details. Let’s zoom in on a single month (October 1928) using the @subset
function.
= Dates.DateTime(1928, 10, 1, 0)
t_start = Dates.DateTime(1928, 10, 31, 23)
t_end = @subset(df, t_start .<= :datetime .<= t_end)
df_month first(df_month, 3)
Row | datetime | lsl |
---|---|---|
DateTime | Quantity… | |
1 | 1928-10-01T00:00:00 | 0.215 m |
2 | 1928-10-01T01:00:00 | 0.429 m |
3 | 1928-10-01T02:00:00 | 0.581 m |
Now we can plot it as above.
plot(
df_month.datetime,
df_month.lsl;="Water levels at Sewells Point, VA",
title="Water level",
ylabel=false,
label )
4.3 Instructions
Change the start and end dates to plot March 2020. What do you notice? :::
4.4 Groupby
An essential idea in working with tabular data (and other data formats) is “split-apply-combine”. Essentially: split the data into groups, apply some function to each group, and then combine the results.
We can use this workflow to answer an interesting question: what is the average water level for each month?2
:month] = Dates.month.(df.datetime)
df[!, dropmissing!(df, :lsl)
= groupby(df, :month)
df_bymonth = combine(df_bymonth, :lsl => mean => :lsl_avg); df_climatology
- 1
-
This creates a new column called
:month
that is the month of each observation. - 2
-
This will discard any rows in
df
that have a missing value of:lsl
. This is necessary because themean
function will returnmissing
if any of the values are missing. - 3
-
This creates a
GroupedDataFrame
object that contains all the data grouped by month. - 4
-
This takes the grouped data and calculates the mean of the
:lsl
column for each month. The general syntax iscombine(grouped_df, :column => function)
.
We can now plot the climatology.
plot(
df_climatology.month,
df_climatology.lsl_avg;=1:12,
xticks="Month",
xlabel="Average Water level",
ylabel=3,
linewidth=false,
label )
- 1
-
We can use
df.colname
instead ofdf[!, :colname]
. The latter is more robust but the former is easier to type. - 2
-
Setting
xticks
will set the x-axis ticks to the values in the vector. We can use this to make sure the x-axis ticks are labeled with the months. - 3
- We can set the line width to make the plot easier to read.
- Use the full dataset to plot the climatology using data from all years
- Next, create a new Markdown header (
## Groupby Day of Year
) plot the average water level for each day of the year (Dates.dayofyear
from 1 to 366). - What do you notice?
4.5 Boxplot
Boxplots are ways to visualize the distribution of data. They show the median (the line in the middle of the box), the interquartile range (the box), and the range of the data (the whiskers). Outliers are shown as dots. We can use the boxplot
function from the StatsPlots.jl
package:
boxplot(
:month],
df[!, :lsl];
df[!, =1:12,
xticks="Month",
xlabel="Water level",
ylabel=false,
label="Climatology",
title )
- 1
-
We are back to
df[!, :colname]
syntax. Both work! - 2
-
We can set the title using the
title
keyword argument.
Repeat this analysis grouping by year rather than month. What do you notice from the boxplots?
4.6 Submission
- Remove all of the instructions blocks (from the first
::: {.callout-important}
to the following:::
) - Remove all of the text I have written (including this block) so that all the text is your own. This makes it much easier to grade and to follow what is yours.
- Do not delete any of the headers
- Thus, the beginning of the document should have the headers for the overview, exploratory modeling, instructions, and getting help, but these should not have any text
- This will ensure all documents have the same numbering to make it easier to troubleshoot any issues
- Do not delete any of the code
- Do not delete any of the headers
- Make sure your code runs (click the “Run All” button in the command palette)
- Render your notebook as HTML (optional)
- Open comand palette
- Select
Quarto: Render HTML
- This will create a website that you can view in your browser. The address will be in your terminal like
Watching files for changes Browse at http://localhost:4200/labs/lab02/instructions.html
- Make sure your notebook looks right in the browser.
- Render your hnotebook as DOCX (required)
- Open comand palette
- Select
Quarto: Render DOCX
- It may give you a link to click on in order to download the file. Click it and it will be downloaded to your computer (probably in your Downloads folder)
- Submit the
.docx
file to Canvas.- Proofread before you submit!
Footnotes
Julia precompiles packages when they are installed, and (to a lesser extent) when they are first used. The first time you use a package it may take a moment to load. This is normal, nothing to worry about, and rapidly improving.↩︎
To do a better job, we should separate out the long-term trend from the seasonal cycle. This is called de-trending and is a common technique in climate science. We can worry more about this later.↩︎