Lab 2: Julia Data Handling

DataFrames.jl, Plotting with CairoMakie, TX precipitation datasets, Basic exploratory data analysis

Author

CEVE 543 Fall 2025

Published

Fri., Sep. 5

1 Learning Goals

  1. Download and parse real weather data (annual maximum rainfall) from NOAA
  2. Create DataFrames to organize station metadata and rainfall measurements
  3. Build interactive maps and time series plots
  4. Calculate distances between geographic points
  5. Compare data across multiple weather stations

2 Do Before Lab

  1. Set up GitHub Copilot. You’ll use GitHub Copilot - an AI assistant that helps write code. It’s especially useful for plotting and data manipulation syntax.
    1. Visit GitHub Copilot and sign up using your education account
    2. Install the GitHub Copilot extension in VS Code
    3. Link your VS Code to your GitHub account when prompted
    4. You should see Copilot suggestions appear as you type code
  2. Accept the assignment. Use the GitHub Classroom link posted on Canvas to “accept” the assignment, which will give you your own GitHub repository.
  3. Clone your repository to your computer using GitHub Desktop or the command line.
  4. Open the repository in VS Code directly from VS Code, GitHub Desktop, or your command line.
  5. Activate the Julia environment. In VS Code, open the command palette (View → Command Palette) and type “Julia: Start REPL”. Alternatively, run julia --project=. in the terminal, then run using Pkg; Pkg.instantiate() to install required packages. The easiest method is to type ] in the REPL to enter package mode, then type instantiate.

3 Do During Lab

  1. Verify rendering works. Run quarto render index.qmd in your terminal - it should create both PDF and HTML documents.

  2. Work through the lab step by step. You can either:

    • Run code line by line in your Julia REPL, or
    • Run quarto preview index.qmd for a live preview that updates automatically

    Make sure you can run each line of code and use the annotations to understand what each line does.

  3. Try these modifications as you work through the lab:

    • Change the target location - Pick a different city and find the nearest weather stations
    • Adjust the number of stations - Plot more or fewer stations in the comparison
    • Modify the map appearance - Try different color schemes or marker sizes
    • Explore different time periods - Focus on specific decades or years
  4. Complete the analysis question at the end of the document.

4 Background

We are deliberately not providing a gentle introduction to Julia here, because everyone is at a different level of familiarity with the language. Instead, refer to the course website’s Julia learning resources (Julia for Nervous Beginners is a good place to start if that sounds like you!). You aren’t expected to start working on the problem set yet, but you should get up to speed with Julia – if you have exposure to Julia or are a proficient R/Matlab/Python programmer, this will be a period of light work for you; if you’ve never seen Julia then it will be heavier work.

Key Documentation:

Understanding Code Annotations: This lab uses interactive code annotations to explain what each line does. You’ll see numbers like # <1> in the code - hover over them to see explanations in popup tooltips.

5 Code

5.1 Loading Required Packages

First, we need to load all the Julia packages we’ll use in this analysis:

# Core data manipulation packages
using CSV
using DataFrames
using Dates
using Distributions
using Downloads
using TidierData
using TidierFiles
using Unitful

ENV["DATAFRAMES_ROWS"] = 5
1
Load CSV package for reading the NOAA text file into Julia
2
Load DataFrames for working with tabular data (like Excel spreadsheets in code)
3
Load Dates to properly parse and handle time information in weather data
4
Load Downloads to fetch files from NOAA’s web servers
5
Load TidierData for modern data manipulation with pipeline syntax (@chain, @filter, @select)
6
Load TidierFiles for simplified file reading and writing operations
7
Load Unitful to handle units of measurement (inches, kilometers, degrees) properly
8
Set the maximum number of rows to display in DataFrames by default

Next, we load the plotting packages and configure our visualization backend:

# Plotting and visualization packages
using GLMakie
using CairoMakie
using ColorSchemes
using GeoMakie
using Makie.Unitful

# Configure plotting backend - CairoMakie for high-quality static plots
CairoMakie.activate!(type = "svg")
# If you want interactive plots, uncomment the line below instead:
# GLMakie.activate!()

# Configure Unitful display preferences to use inches
rainfall_conversion = Makie.UnitfulConversion(u"inch")
1
Load GLMakie for interactive plots with GPU acceleration (default choice)
2
Load CairoMakie as backup for high-quality static plots if GLMakie fails
3
Load ColorSchemes to access professional color palettes (viridis, plasma, etc.)
4
Load GeoMakie for geographic visualizations and map projections
5
Load Makie.Unitful to integrate our unit system with the plotting functions
6
Activate CairoMakie as the plotting backend for high-quality static plots
7
Alternative activation for interactive plots if needed
8
Create unit converter to display rainfall in inches instead of default “deca-inches”

Finally, we load our custom utility functions that contain the data parsing logic:

# Load custom utility functions
include("util.jl")
1
Load our custom parsing functions from util.jl (keeps the main notebook clean)

5.2 Downloading Weather Data

Next, we’ll download precipitation data from the National Weather Service:

fname = "dur01d_ams_na14v11.txt"
url = "https://hdsc.nws.noaa.gov/pub/hdsc/data/tx/dur01d_ams_na14v11.txt"

if !isfile(fname)
    Downloads.download(url, fname)
end
1
Set the local filename for our Texas precipitation data
2
NOAA’s web address for the Texas Atlas-14 precipitation dataset
3
Only download if we don’t already have the file (the ! means “not”)
4
Download the file from NOAA’s server to our local directory

This prevents re-downloading the file every time we run the code, which is faster and more polite to NOAA’s servers.

5.3 Parsing the Weather Data

Now we’ll use our custom function to parse the downloaded data into a useful format:

stations, rainfall_data = read_noaa_data(fname)

display(stations)
display(rainfall_data)
1
Call our custom function that parses the NOAA file into two DataFrames
2
Display the stations DataFrame (shows first 5 rows due to ENV setting)
3
Display the rainfall DataFrame to see the data structure
816×7 DataFrame
811 rows omitted
Row stnid noaa_id name state latitude longitude years_of_data
Int64 String String String Float64 Float64 Int64
1 1 60-0011 CLEAR CK AT BAY AREA BLVD TX 29.4977 -95.1599 31
2 2 60-0019 TURKEY CK AT FM 1959 TX 29.5845 -95.1869 31
3 3 60-0022 ARMAND BYU AT GENOARED BLF RD TX 29.6345 -95.1123 31
815 815 87-0031 SECO CREEK AT MILLER RANCH TX 29.5731 -99.4028 20
816 816 99-2048 COTULLA TX 28.4567 -99.2183 116
61925×4 DataFrame
61920 rows omitted
Row stnid date year rainfall
Int64 Date Int64 Quantity…?
1 1 1987-06-11 1987 6.31 inch
2 1 1988-09-02 1988 5.46 inch
3 1 1989-08-01 1989 11.39 inch
61924 816 2016-08-20 2016 3.44 inch
61925 816 2017-09-25 2017 2.72 inch

The read_noaa_data function separates the messy text file into:

  • stations: A DataFrame with each station’s location, name, and metadata
  • rainfall_data: A DataFrame with all rainfall measurements, using stnid to link to stations

5.4 Modern Data Manipulation with TidierData.jl

Now let’s explore our data using TidierData.jl, which provides a modern, readable syntax for data manipulation. TidierData brings tidyverse-style operations to Julia with intuitive verb names and pipeline workflows.

For complete documentation, visit: https://tidierorg.github.io/Tidier.jl/dev/

5.4.1 Basic TidierData Operations

Let’s start with some basic data exploration using TidierData syntax:

@chain stations begin
    @select(stnid, noaa_id, name, state, latitude, longitude, years_of_data)
end
1
Start a pipeline where each operation flows into the next (like %>% in R)
2
Choose specific columns to display (similar to SQL SELECT)
816×7 DataFrame
811 rows omitted
Row stnid noaa_id name state latitude longitude years_of_data
Int64 String String String Float64 Float64 Int64
1 1 60-0011 CLEAR CK AT BAY AREA BLVD TX 29.4977 -95.1599 31
2 2 60-0019 TURKEY CK AT FM 1959 TX 29.5845 -95.1869 31
3 3 60-0022 ARMAND BYU AT GENOARED BLF RD TX 29.6345 -95.1123 31
815 815 87-0031 SECO CREEK AT MILLER RANCH TX 29.5731 -99.4028 20
816 816 99-2048 COTULLA TX 28.4567 -99.2183 116

Note that our rainfall measurements have explicit units (you’ll see inch in the output) thanks to the Unitful.jl package. This makes unit conversions automatic and error-free. For example, to convert to centimeters:

@chain rainfall_data begin
    @mutate(rainfall = uconvert(u"cm", rainfall))
end
1
Convert rainfall measurements from inches to centimeters automatically with proper unit handling
61925×4 DataFrame
61920 rows omitted
Row stnid date year rainfall
Int64 Date Int64 Quantity…?
1 1 1987-06-11 1987 16.0274 cm
2 1 1988-09-02 1988 13.8684 cm
3 1 1989-08-01 1989 28.9306 cm
61924 816 2016-08-20 2016 8.7376 cm
61925 816 2017-09-25 2017 6.9088 cm

5.4.2 Filtering and Transforming Data

# Find stations in Texas with lots of data
@chain stations begin
    @filter(state == "TX" && years_of_data > 50)
    @select(noaa_id, name, years_of_data)
    @arrange(desc(years_of_data))
end
1
Start a pipeline using our stations DataFrame
2
Keep only rows where state is “TX” AND years_of_data > 50
3
Choose which columns to show in our result
4
Sort by years_of_data in descending order (most data first)
638×3 DataFrame
633 rows omitted
Row noaa_id name years_of_data
String String Int64
1 41-0420 AUSTIN 169
2 41-7622 RIO GRANDE CITY 168
3 79-0043 BROWNSVILLE 168
637 41-6119 MT VERNON 51
638 41-8761 SUNRAY 4 SW 51

5.4.3 Adding Calculated Columns

# Create categorical data quality labels
@chain stations begin
    @mutate(
        data_category = case_when(
            years_of_data >= 60 => "Long-term (60+ years)",
            years_of_data >= 30 => "Medium-term (30-59 years)",
            years_of_data >= 10 => "Short-term (10-29 years)",
            true => "Very short (<10 years)",
        )
    )
    @select(noaa_id, name, years_of_data, data_category)
end
1
Start our data pipeline with the stations DataFrame
2
Create or modify columns in the DataFrame
3
Use conditional logic to create a new categorical column
4
If 60+ years of data, categorize as “Long-term”
5
If 30-59 years of data, categorize as “Medium-term”
6
If 10-29 years of data, categorize as “Short-term”
7
Default case for stations with less than 10 years
8
Select only the columns we want to display
816×4 DataFrame
811 rows omitted
Row noaa_id name years_of_data data_category
String String Int64 String
1 60-0011 CLEAR CK AT BAY AREA BLVD 31 Medium-term (30-59 years)
2 60-0019 TURKEY CK AT FM 1959 31 Medium-term (30-59 years)
3 60-0022 ARMAND BYU AT GENOARED BLF RD 31 Medium-term (30-59 years)
815 87-0031 SECO CREEK AT MILLER RANCH 20 Short-term (10-29 years)
816 99-2048 COTULLA 116 Long-term (60+ years)

5.4.4 Grouping and Summarizing Data

# Analyze temporal coverage of rainfall data
obs_per_year = @chain rainfall_data begin
    @group_by(year)
    @summarise(
        station_count = n(),
    )
    @arrange(desc(station_count))
end
1
Start with rainfall DataFrame to see which years have the most station coverage
2
Group observations by year to count stations active each year
3
Calculate summary statistics for each year group
4
Count how many stations reported data in each year
5
Sort years by station count (years with most stations first)
170×2 DataFrame
165 rows omitted
Row year station_count
Int64 Int64
1 1995 735
2 1996 735
3 1997 735
169 1850 4
170 1849 1

5.5 Creating a Map of Weather Stations

Let’s visualize all the weather stations on a map, with colors showing how many years of data each station has.

For more advanced mapping techniques, see: https://geo.makie.org/dev/introduction

For complete Makie plotting documentation, visit: https://docs.makie.org/stable/

function plot_stations()

    # Create map plot colored by years of data
    fig = Figure(size = (800, 600))
    ga = GeoAxis(fig[1, 1]; source = "+proj=latlong", dest = "+proj=merc",
        title = "Texas Atlas-14 Stations", xgridvisible = false, ygridvisible = false)

    # Add US states (white with black borders)
    states = GeoMakie.naturalearth("admin_1_states_provinces_lakes", 110)
    poly!(ga, states.geometry;
        strokecolor = :black, strokewidth = 1, color = :white)

    # Create scatter plot with viridis colormap
    scatter!(ga, stations.longitude, stations.latitude;
        color = stations.years_of_data, colormap = :viridis, markersize = 10)

    # Set plot bounds based on data bounds plus delta
    delta = 0.3  # degrees of padding around the data bounds
    min_lon, max_lon = extrema(stations.longitude)
    min_lat, max_lat = extrema(stations.latitude)

    xlims!(ga, min_lon - delta, max_lon + delta)
    ylims!(ga, min_lat - delta, max_lat + delta)

    # Add colorbar
    Colorbar(fig[1, 2], label = "Years of Data", colormap = :viridis,
        limits = (minimum(stations.years_of_data), maximum(stations.years_of_data)))

    return fig
end
plot_stations()
1
Create a GeoAxis with map projection: converts lat/lon coordinates to Mercator for proper display
2
Download US state boundaries from Natural Earth (110 = medium resolution) for map background
3
Plot weather stations colored by years of data using the viridis color scheme

5.6 Finding Stations Near a Target Location

Let’s say we want to study rainfall at a particular point. We’ll use Rice University in Houston as our example:

target_location = "Rice University"
target_lon = -95.404970
target_lat = 29.718952
1
Descriptive name for our target location (change this for different places!)
2
Rice University’s longitude coordinate (negative means west of Greenwich)
3
Rice University’s latitude coordinate (positive means north of equator)
29.718952

5.7 Calculating Distances to Weather Stations

Now we need to calculate how far each weather station is from our target location using the Haversine formula:

# Haversine formula for great circle distance with units
function calc_distance(lon1, lat1, lon2, lat2)
    R = 6378.0u"km"

    # Convert degrees to radians
    φ1 = deg2rad(lat1)
    φ2 = deg2rad(lat2)
    Δφ = deg2rad(lat2 - lat1)
    Δλ = deg2rad(lon2 - lon1)

    # Haversine formula
    a = sin(Δφ / 2)^2 + cos1) * cos2) * sin(Δλ / 2)^2
    c = 2 * atan(sqrt(a), sqrt(1 - a))

    return R * c
end

stations = @chain stations begin
    @mutate(distance_km = calc_distance(longitude, latitude, !!target_lon, !!target_lat))
    @arrange(distance_km)
end
stations
1
Define a function that calculates great circle distance between two lat/lon points
2
Earth’s radius in kilometers (using proper units!)
3
Convert the first latitude from degrees to radians (φ is Greek phi)
4
Convert the second latitude to radians
5
Calculate difference in latitudes (Δ is Greek delta, means “change in”)
6
Calculate difference in longitudes (λ is Greek lambda)
7
First part of Haversine formula: angular distance component
8
Complete the great circle distance calculation
9
Multiply by Earth’s radius to get distance in kilometers
10
Start a TidierData pipeline with our stations DataFrame
11
Add distance column using !! to inject external variables into the TidierData expression
12
Sort the stations by distance (closest first)
13
Display the sorted stations (shows closest ones first due to ENV row limit)
816×8 DataFrame
811 rows omitted
Row stnid noaa_id name state latitude longitude years_of_data distance_km
Int64 String String String Float64 Float64 Int64 Quantity…
1 780 79-0056 HOUSTON WB CITY TX 29.7622 -95.3593 129 6.53163 km
2 377 41-4321 HOUSTON HEIGHTS TX 29.7914 -95.4261 62 8.31921 km
3 380 41-4326 HOUSTON-PORT TX 29.7456 -95.28 29 12.4388 km
815 807 79-0125 EL PASO TX 31.7587 -106.484 167 1083.55 km
816 428 41-4931 LA TUNA 1 S TX 31.98 -106.597 75 1098.28 km

The Haversine formula accounts for Earth’s curvature, giving much more accurate distances than simple straight-line calculations.

5.8 Examining the Closest Weather Station

Let’s focus on the closest weather station and look at its rainfall data:

closest_station = first(stations)
closest_stnid = closest_station.stnid
closest_rainfall = @chain rainfall_data begin
    @filter(stnid == !!closest_stnid)
end
closest_rainfall
1
Filter rainfall data using !! to inject the external variable into TidierData
129×4 DataFrame
124 rows omitted
Row stnid date year rainfall
Int64 Date Int64 Quantity…?
1 780 1889-05-18 1889 3.18 inch
2 780 1890-04-02 1890 4.75 inch
3 780 1891-01-08 1891 5.89 inch
128 780 2016-04-17 2016 5.69 inch
129 780 2017-08-26 2017 14.7 inch

5.9 Plotting a Time Series

Now let’s create a time series plot to see how rainfall has changed over time at this station:

function plot_time_series(station_row, rainfall_df)
    # Create time series plot of rainfall data
    fig = Figure(size = (800, 400))
    ax = Axis(fig[1, 1],
        ylabel = "Annual Maximum Rainfall",
        title = "$(station_row.noaa_id): $(station_row.name)",
        dim2_conversion = rainfall_conversion)

    # Plot the time series
    lines!(ax, rainfall_df.date, rainfall_df.rainfall)
    scatter!(ax, rainfall_df.date, rainfall_df.rainfall,
        markersize = 10, marker = :circle, strokewidth = 2, color = :transparent)

    fig
end

plot_time_series(closest_station, closest_rainfall)
1
Use string interpolation $() to insert station ID and name into the plot title
2
Apply unit converter to display rainfall in inches instead of deca-inches
3
Plot using date column to show precise timing of annual maximum events
4
Add hollow circles with transparent fill to emphasize individual data points

5.10 Comparing Multiple Nearby Stations

This is interesting! How does this compare to other nearby stations? Let’s plot several stations together:

# Compare with other nearby stations
function plot_nearby_comparison(stations_df, rainfall_data; n_stations = 4)
    fig = Figure(size = (800, 400))
    ax = Axis(fig[1, 1],
        ylabel = "Annual Maximum Rainfall",
        title = "Comparison of $(n_stations) Nearest Stations to $(target_location)",
        dim2_conversion = rainfall_conversion)

    colors = [get(colorschemes[:viridis], i / n_stations) for i in 0:n_stations-1]

    for (i, station) in enumerate(eachrow(stations_df[1:n_stations, :]))
        station_id = station.stnid
        rainfall = @chain rainfall_data begin
            @filter(stnid == !!station_id)
        end
        lines!(ax, rainfall.year, rainfall.rainfall,
            color = colors[i], linewidth = 2,
            label = "$(station.noaa_id)")
        scatter!(ax, rainfall.year, rainfall.rainfall,
            color = colors[i], markersize = 10)
    end
    xlims!(ax, 1950, 2025)

    axislegend(ax, position = :lt)
    return fig
end

plot_nearby_comparison(stations, rainfall_data; n_stations = 4)
1
Generate evenly-spaced colors from the viridis palette for each station
2
Loop through the first n_stations closest weather stations
3
Use !! to inject the loop variable into the TidierData filter expression
4
Add a legend in the top-left corner to identify different stations

5.11 GEV Distribution and Return Period Analysis

Now we’ll work with a GEV distribution to create return period plots and compare with observed data using Weibull plotting positions.

Your task: Experiment with different GEV parameters to see how well you can match the observed data from your closest station. You’ll need to systematically test each parameter to answer the questions at the end. Start with these parameters, then modify them one at a time:

# GEV parameters for Houston area (units: inches)
# Try modifying these values to better fit your data!
μ = 4.0    # Location parameter - shifts the distribution left/right
σ = 1.3    # Scale parameter - controls spread (must be > 0)
ξ = 0.15   # Shape parameter - controls tail behavior

houston_gev = GeneralizedExtremeValue(μ, σ, ξ)
1
Create GEV distribution with the specified parameters
Distributions.GeneralizedExtremeValue{Float64}(μ=4.0, σ=1.3, ξ=0.15)

5.12 Return Period Plot

# Function to calculate return levels from GEV
function gev_return_level(gev_dist, return_period)
    μ, σ, ξ = params(gev_dist)
    if abs(ξ) < 1e-6  # Gumbel case
        return μ - σ * log(-log(1 - 1 / return_period))
    else
        return μ -/ ξ) * (1 - (-log(1 - 1 / return_period))^(-ξ))
    end
end

# Create return period plot
function plot_return_periods(gev_dist)
    fig = Figure(size = (800, 400))
    ax = Axis(fig[1, 1],
        xlabel = "Return Period (years)",
        ylabel = "Return Level (inches)",
        title = "GEV Return Period Plot",
        xscale = log10)

    # Plot curve and highlight standard return periods
    T_smooth = 10 .^ range(log10(1.1), log10(250), length = 100)
    levels_smooth = [gev_return_level(gev_dist, T) for T in T_smooth]
    lines!(ax, T_smooth, levels_smooth, color = :blue, linewidth = 2, label = "GEV Model")

    return_periods = [5, 10, 25, 50, 100, 250]

    ax.xticks = [5, 10, 25, 50, 100, 250]
    return fig
end

plot_return_periods(houston_gev)
1
Extract GEV parameters from distribution object
2
Gumbel formula when \xi \approx 0
3
General GEV return level formula
4
Log scale for x-axis
5
Standard return periods for flood analysis

5.13 Weibull Plotting Positions - Comparing Theory vs Reality

Now let’s see how well your chosen GEV parameters match the actual observed data! The red dots show the empirical return periods calculated from your station’s data using Weibull plotting positions.

What to look for as you experiment: - Do the red dots follow the blue GEV curve? - Are there systematic deviations at high or low return periods? - How does each parameter change affect the fit?

Systematic experimentation approach: 1. First, try increasing \mu by 1-2 inches, then decreasing it. Note the effect. 2. Next, try increasing \sigma by 0.5, then decreasing it. Note the effect. 3. Finally, try changing \xi to 0.0 (Gumbel), then to -0.1 (light tail), then 0.3 (heavy tail). Note the effect. 4. Find your best combination!

# Weibull plotting positions
function weibull_plotting_positions(data)
    clean_data = sort(collect(skipmissing(data)))
    n = length(clean_data)
    plotting_positions = [i / (n + 1) for i in 1:n]
    empirical_return_periods = [1 / (1 - p) for p in plotting_positions]
    return clean_data, empirical_return_periods
end

# Plot empirical vs theoretical
function plot_empirical_vs_theoretical(station_data, gev_dist, station_info)
    # Start with the basic return period plot
    fig = plot_return_periods(gev_dist)
    ax = fig.content[1]  # Get the actual Axis object

    # Update title to include station info
    ax.title = "Empirical vs Theoretical Return Periods\n$(station_info.noaa_id): $(station_info.name)"

    # Add empirical data with Weibull positions
    emp_levels, emp_periods = weibull_plotting_positions(station_data.rainfall)
    scatter!(ax, emp_periods, emp_levels,
        color = :red, markersize = 10,
        label = "Observed Data (Weibull positions)")

    axislegend(ax, position = :rb)

    return fig
end

plot_empirical_vs_theoretical(closest_rainfall, houston_gev, closest_station)

6 Analysis Questions

6.2 GEV Parameter Experimentation

Starting Parameters vs Final Parameters:

Record your starting parameters and the final parameters you chose after experimentation:

  • Starting: \mu = 4.0, \sigma = 1.3, \xi = 0.15
  • Final: \mu = \ldots \sigma = \ldots, \xi = \ldots

Parameter Effects:

What happened when you increased/decreased \mu (location parameter)?

Test values like \mu = 2.0, 4.0, 6.0. Describe how the entire GEV curve shifted and how this affected the fit to your red data points.

What happened when you increased/decreased \sigma (scale parameter)?

Test values like \sigma = 0.8, 1.3, 2.0. Describe how the curve became steeper/flatter and how this affected the spread of return levels.

What happened when you changed \xi (shape parameter)?

Test \xi = -0.1 (light tail), \xi = 0.0 (Gumbel), \xi = 0.15, \xi = 0.3 (heavy tail). Focus on how the high return period end (right side) of the curve changed.

Final Assessment:

How well does your best-fit GEV distribution represent the closest station’s data?

Do the red dots follow your blue curve? Where are the biggest deviations?

Additional observations:

Optional: Add any other interesting patterns you notice in the return period analysis.