DataFrames.jl, Plotting with CairoMakie, TX precipitation datasets, Basic exploratory data analysis
Author
CEVE 543 Fall 2025
Published
Fri., Sep. 5
1 Learning Goals
Download and parse real weather data (annual maximum rainfall) from NOAA
Create DataFrames to organize station metadata and rainfall measurements
Build interactive maps and time series plots
Calculate distances between geographic points
Compare data across multiple weather stations
2 Do Before Lab
Set up GitHub Copilot. You’ll use GitHub Copilot - an AI assistant that helps write code. It’s especially useful for plotting and data manipulation syntax.
Visit GitHub Copilot and sign up using your education account
Install the GitHub Copilot extension in VS Code
Link your VS Code to your GitHub account when prompted
You should see Copilot suggestions appear as you type code
Accept the assignment. Use the GitHub Classroom link posted on Canvas to “accept” the assignment, which will give you your own GitHub repository.
Clone your repository to your computer using GitHub Desktop or the command line.
Open the repository in VS Code directly from VS Code, GitHub Desktop, or your command line.
Activate the Julia environment. In VS Code, open the command palette (View → Command Palette) and type “Julia: Start REPL”. Alternatively, run julia --project=. in the terminal, then run using Pkg; Pkg.instantiate() to install required packages. The easiest method is to type ] in the REPL to enter package mode, then type instantiate.
3 Do During Lab
Verify rendering works. Run quarto render index.qmd in your terminal - it should create both PDF and HTML documents.
Work through the lab step by step. You can either:
Run code line by line in your Julia REPL, or
Run quarto preview index.qmd for a live preview that updates automatically
Make sure you can run each line of code and use the annotations to understand what each line does.
Try these modifications as you work through the lab:
Change the target location - Pick a different city and find the nearest weather stations
Adjust the number of stations - Plot more or fewer stations in the comparison
Modify the map appearance - Try different color schemes or marker sizes
Explore different time periods - Focus on specific decades or years
Complete the analysis question at the end of the document.
4 Background
We are deliberately not providing a gentle introduction to Julia here, because everyone is at a different level of familiarity with the language. Instead, refer to the course website’s Julia learning resources (Julia for Nervous Beginners is a good place to start if that sounds like you!). You aren’t expected to start working on the problem set yet, but you should get up to speed with Julia – if you have exposure to Julia or are a proficient R/Matlab/Python programmer, this will be a period of light work for you; if you’ve never seen Julia then it will be heavier work.
Understanding Code Annotations: This lab uses interactive code annotations to explain what each line does. You’ll see numbers like # <1> in the code - hover over them to see explanations in popup tooltips.
5 Code
5.1 Loading Required Packages
First, we need to load all the Julia packages we’ll use in this analysis:
# Core data manipulation packagesusingCSVusingDataFramesusingDatesusingDistributionsusingDownloadsusingTidierDatausingTidierFilesusingUnitfulENV["DATAFRAMES_ROWS"] =5
1
Load CSV package for reading the NOAA text file into Julia
2
Load DataFrames for working with tabular data (like Excel spreadsheets in code)
3
Load Dates to properly parse and handle time information in weather data
4
Load Downloads to fetch files from NOAA’s web servers
5
Load TidierData for modern data manipulation with pipeline syntax (@chain, @filter, @select)
6
Load TidierFiles for simplified file reading and writing operations
7
Load Unitful to handle units of measurement (inches, kilometers, degrees) properly
8
Set the maximum number of rows to display in DataFrames by default
Next, we load the plotting packages and configure our visualization backend:
# Plotting and visualization packagesusingGLMakieusingCairoMakieusingColorSchemesusingGeoMakieusingMakie.Unitful# Configure plotting backend - CairoMakie for high-quality static plotsCairoMakie.activate!(type="svg")# If you want interactive plots, uncomment the line below instead:# GLMakie.activate!()# Configure Unitful display preferences to use inchesrainfall_conversion = Makie.UnitfulConversion(u"inch")
1
Load GLMakie for interactive plots with GPU acceleration (default choice)
2
Load CairoMakie as backup for high-quality static plots if GLMakie fails
3
Load ColorSchemes to access professional color palettes (viridis, plasma, etc.)
4
Load GeoMakie for geographic visualizations and map projections
5
Load Makie.Unitful to integrate our unit system with the plotting functions
6
Activate CairoMakie as the plotting backend for high-quality static plots
7
Alternative activation for interactive plots if needed
8
Create unit converter to display rainfall in inches instead of default “deca-inches”
Finally, we load our custom utility functions that contain the data parsing logic:
# Load custom utility functionsinclude("util.jl")
1
Load our custom parsing functions from util.jl (keeps the main notebook clean)
5.2 Downloading Weather Data
Next, we’ll download precipitation data from the National Weather Service:
Call our custom function that parses the NOAA file into two DataFrames
2
Display the stations DataFrame (shows first 5 rows due to ENV setting)
3
Display the rainfall DataFrame to see the data structure
816×7 DataFrame
811 rows omitted
Row
stnid
noaa_id
name
state
latitude
longitude
years_of_data
Int64
String
String
String
Float64
Float64
Int64
1
1
60-0011
CLEAR CK AT BAY AREA BLVD
TX
29.4977
-95.1599
31
2
2
60-0019
TURKEY CK AT FM 1959
TX
29.5845
-95.1869
31
3
3
60-0022
ARMAND BYU AT GENOARED BLF RD
TX
29.6345
-95.1123
31
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
815
815
87-0031
SECO CREEK AT MILLER RANCH
TX
29.5731
-99.4028
20
816
816
99-2048
COTULLA
TX
28.4567
-99.2183
116
61925×4 DataFrame
61920 rows omitted
Row
stnid
date
year
rainfall
Int64
Date
Int64
Quantity…?
1
1
1987-06-11
1987
6.31 inch
2
1
1988-09-02
1988
5.46 inch
3
1
1989-08-01
1989
11.39 inch
⋮
⋮
⋮
⋮
⋮
61924
816
2016-08-20
2016
3.44 inch
61925
816
2017-09-25
2017
2.72 inch
The read_noaa_data function separates the messy text file into:
stations: A DataFrame with each station’s location, name, and metadata
rainfall_data: A DataFrame with all rainfall measurements, using stnid to link to stations
5.4 Modern Data Manipulation with TidierData.jl
Now let’s explore our data using TidierData.jl, which provides a modern, readable syntax for data manipulation. TidierData brings tidyverse-style operations to Julia with intuitive verb names and pipeline workflows.
For complete documentation, visit: https://tidierorg.github.io/Tidier.jl/dev/
5.4.1 Basic TidierData Operations
Let’s start with some basic data exploration using TidierData syntax:
Start a pipeline where each operation flows into the next (like %>% in R)
2
Choose specific columns to display (similar to SQL SELECT)
816×7 DataFrame
811 rows omitted
Row
stnid
noaa_id
name
state
latitude
longitude
years_of_data
Int64
String
String
String
Float64
Float64
Int64
1
1
60-0011
CLEAR CK AT BAY AREA BLVD
TX
29.4977
-95.1599
31
2
2
60-0019
TURKEY CK AT FM 1959
TX
29.5845
-95.1869
31
3
3
60-0022
ARMAND BYU AT GENOARED BLF RD
TX
29.6345
-95.1123
31
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
815
815
87-0031
SECO CREEK AT MILLER RANCH
TX
29.5731
-99.4028
20
816
816
99-2048
COTULLA
TX
28.4567
-99.2183
116
Note that our rainfall measurements have explicit units (you’ll see inch in the output) thanks to the Unitful.jl package. This makes unit conversions automatic and error-free. For example, to convert to centimeters:
Start with rainfall DataFrame to see which years have the most station coverage
2
Group observations by year to count stations active each year
3
Calculate summary statistics for each year group
4
Count how many stations reported data in each year
5
Sort years by station count (years with most stations first)
170×2 DataFrame
165 rows omitted
Row
year
station_count
Int64
Int64
1
1995
735
2
1996
735
3
1997
735
⋮
⋮
⋮
169
1850
4
170
1849
1
5.5 Creating a Map of Weather Stations
Let’s visualize all the weather stations on a map, with colors showing how many years of data each station has.
For more advanced mapping techniques, see: https://geo.makie.org/dev/introduction
For complete Makie plotting documentation, visit: https://docs.makie.org/stable/
functionplot_stations()# Create map plot colored by years of data fig =Figure(size = (800, 600)) ga =GeoAxis(fig[1, 1]; source ="+proj=latlong", dest ="+proj=merc", title ="Texas Atlas-14 Stations", xgridvisible =false, ygridvisible =false)# Add US states (white with black borders) states = GeoMakie.naturalearth("admin_1_states_provinces_lakes", 110)poly!(ga, states.geometry; strokecolor =:black, strokewidth =1, color =:white)# Create scatter plot with viridis colormapscatter!(ga, stations.longitude, stations.latitude; color = stations.years_of_data, colormap =:viridis, markersize =10)# Set plot bounds based on data bounds plus delta delta =0.3# degrees of padding around the data bounds min_lon, max_lon =extrema(stations.longitude) min_lat, max_lat =extrema(stations.latitude)xlims!(ga, min_lon - delta, max_lon + delta)ylims!(ga, min_lat - delta, max_lat + delta)# Add colorbarColorbar(fig[1, 2], label ="Years of Data", colormap =:viridis, limits = (minimum(stations.years_of_data), maximum(stations.years_of_data)))return figendplot_stations()
1
Create a GeoAxis with map projection: converts lat/lon coordinates to Mercator for proper display
2
Download US state boundaries from Natural Earth (110 = medium resolution) for map background
3
Plot weather stations colored by years of data using the viridis color scheme
5.6 Finding Stations Near a Target Location
Let’s say we want to study rainfall at a particular point. We’ll use Rice University in Houston as our example:
Descriptive name for our target location (change this for different places!)
2
Rice University’s longitude coordinate (negative means west of Greenwich)
3
Rice University’s latitude coordinate (positive means north of equator)
29.718952
5.7 Calculating Distances to Weather Stations
Now we need to calculate how far each weather station is from our target location using the Haversine formula:
# Haversine formula for great circle distance with unitsfunctioncalc_distance(lon1, lat1, lon2, lat2) R =6378.0u"km"# Convert degrees to radians φ1=deg2rad(lat1) φ2=deg2rad(lat2) Δφ =deg2rad(lat2 - lat1) Δλ =deg2rad(lon2 - lon1)# Haversine formula a =sin(Δφ /2)^2+cos(φ1) *cos(φ2) *sin(Δλ /2)^2 c =2*atan(sqrt(a), sqrt(1- a))return R * cendstations =@chain stations begin@mutate(distance_km =calc_distance(longitude, latitude, !!target_lon, !!target_lat))@arrange(distance_km)endstations
1
Define a function that calculates great circle distance between two lat/lon points
2
Earth’s radius in kilometers (using proper units!)
3
Convert the first latitude from degrees to radians (φ is Greek phi)
4
Convert the second latitude to radians
5
Calculate difference in latitudes (Δ is Greek delta, means “change in”)
6
Calculate difference in longitudes (λ is Greek lambda)
7
First part of Haversine formula: angular distance component
8
Complete the great circle distance calculation
9
Multiply by Earth’s radius to get distance in kilometers
10
Start a TidierData pipeline with our stations DataFrame
11
Add distance column using !! to inject external variables into the TidierData expression
12
Sort the stations by distance (closest first)
13
Display the sorted stations (shows closest ones first due to ENV row limit)
816×8 DataFrame
811 rows omitted
Row
stnid
noaa_id
name
state
latitude
longitude
years_of_data
distance_km
Int64
String
String
String
Float64
Float64
Int64
Quantity…
1
780
79-0056
HOUSTON WB CITY
TX
29.7622
-95.3593
129
6.53163 km
2
377
41-4321
HOUSTON HEIGHTS
TX
29.7914
-95.4261
62
8.31921 km
3
380
41-4326
HOUSTON-PORT
TX
29.7456
-95.28
29
12.4388 km
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
815
807
79-0125
EL PASO
TX
31.7587
-106.484
167
1083.55 km
816
428
41-4931
LA TUNA 1 S
TX
31.98
-106.597
75
1098.28 km
The Haversine formula accounts for Earth’s curvature, giving much more accurate distances than simple straight-line calculations.
5.8 Examining the Closest Weather Station
Let’s focus on the closest weather station and look at its rainfall data:
Filter rainfall data using !! to inject the external variable into TidierData
129×4 DataFrame
124 rows omitted
Row
stnid
date
year
rainfall
Int64
Date
Int64
Quantity…?
1
780
1889-05-18
1889
3.18 inch
2
780
1890-04-02
1890
4.75 inch
3
780
1891-01-08
1891
5.89 inch
⋮
⋮
⋮
⋮
⋮
128
780
2016-04-17
2016
5.69 inch
129
780
2017-08-26
2017
14.7 inch
5.9 Plotting a Time Series
Now let’s create a time series plot to see how rainfall has changed over time at this station:
functionplot_time_series(station_row, rainfall_df)# Create time series plot of rainfall data fig =Figure(size = (800, 400)) ax =Axis(fig[1, 1], ylabel ="Annual Maximum Rainfall", title ="$(station_row.noaa_id): $(station_row.name)", dim2_conversion = rainfall_conversion)# Plot the time serieslines!(ax, rainfall_df.date, rainfall_df.rainfall)scatter!(ax, rainfall_df.date, rainfall_df.rainfall, markersize =10, marker =:circle, strokewidth =2, color =:transparent) figendplot_time_series(closest_station, closest_rainfall)
1
Use string interpolation $() to insert station ID and name into the plot title
2
Apply unit converter to display rainfall in inches instead of deca-inches
3
Plot using date column to show precise timing of annual maximum events
4
Add hollow circles with transparent fill to emphasize individual data points
5.10 Comparing Multiple Nearby Stations
This is interesting! How does this compare to other nearby stations? Let’s plot several stations together:
# Compare with other nearby stationsfunctionplot_nearby_comparison(stations_df, rainfall_data; n_stations =4) fig =Figure(size = (800, 400)) ax =Axis(fig[1, 1], ylabel ="Annual Maximum Rainfall", title ="Comparison of $(n_stations) Nearest Stations to $(target_location)", dim2_conversion = rainfall_conversion) colors = [get(colorschemes[:viridis], i / n_stations) for i in0:n_stations-1]for (i, station) inenumerate(eachrow(stations_df[1:n_stations, :])) station_id = station.stnid rainfall =@chain rainfall_data begin@filter(stnid == !!station_id)endlines!(ax, rainfall.year, rainfall.rainfall, color = colors[i], linewidth =2, label ="$(station.noaa_id)")scatter!(ax, rainfall.year, rainfall.rainfall, color = colors[i], markersize =10)endxlims!(ax, 1950, 2025)axislegend(ax, position =:lt)return figendplot_nearby_comparison(stations, rainfall_data; n_stations =4)
1
Generate evenly-spaced colors from the viridis palette for each station
2
Loop through the first n_stations closest weather stations
3
Use !! to inject the loop variable into the TidierData filter expression
4
Add a legend in the top-left corner to identify different stations
5.11 GEV Distribution and Return Period Analysis
Now we’ll work with a GEV distribution to create return period plots and compare with observed data using Weibull plotting positions.
Your task: Experiment with different GEV parameters to see how well you can match the observed data from your closest station. You’ll need to systematically test each parameter to answer the questions at the end. Start with these parameters, then modify them one at a time:
# GEV parameters for Houston area (units: inches)# Try modifying these values to better fit your data!μ =4.0# Location parameter - shifts the distribution left/rightσ =1.3# Scale parameter - controls spread (must be > 0)ξ =0.15# Shape parameter - controls tail behaviorhouston_gev =GeneralizedExtremeValue(μ, σ, ξ)
1
Create GEV distribution with the specified parameters
# Function to calculate return levels from GEVfunctiongev_return_level(gev_dist, return_period) μ, σ, ξ =params(gev_dist)ifabs(ξ) <1e-6# Gumbel casereturn μ - σ *log(-log(1-1/ return_period))elsereturn μ - (σ / ξ) * (1- (-log(1-1/ return_period))^(-ξ))endend# Create return period plotfunctionplot_return_periods(gev_dist) fig =Figure(size = (800, 400)) ax =Axis(fig[1, 1], xlabel ="Return Period (years)", ylabel ="Return Level (inches)", title ="GEV Return Period Plot", xscale = log10)# Plot curve and highlight standard return periods T_smooth =10.^range(log10(1.1), log10(250), length =100) levels_smooth = [gev_return_level(gev_dist, T) for T in T_smooth]lines!(ax, T_smooth, levels_smooth, color =:blue, linewidth =2, label ="GEV Model") return_periods = [5, 10, 25, 50, 100, 250] ax.xticks = [5, 10, 25, 50, 100, 250]return figendplot_return_periods(houston_gev)
1
Extract GEV parameters from distribution object
2
Gumbel formula when \xi \approx 0
3
General GEV return level formula
4
Log scale for x-axis
5
Standard return periods for flood analysis
5.13 Weibull Plotting Positions - Comparing Theory vs Reality
Now let’s see how well your chosen GEV parameters match the actual observed data! The red dots show the empirical return periods calculated from your station’s data using Weibull plotting positions.
What to look for as you experiment: - Do the red dots follow the blue GEV curve? - Are there systematic deviations at high or low return periods? - How does each parameter change affect the fit?
Systematic experimentation approach: 1. First, try increasing \mu by 1-2 inches, then decreasing it. Note the effect. 2. Next, try increasing \sigma by 0.5, then decreasing it. Note the effect. 3. Finally, try changing \xi to 0.0 (Gumbel), then to -0.1 (light tail), then 0.3 (heavy tail). Note the effect. 4. Find your best combination!
# Weibull plotting positionsfunctionweibull_plotting_positions(data) clean_data =sort(collect(skipmissing(data))) n =length(clean_data) plotting_positions = [i / (n +1) for i in1:n] empirical_return_periods = [1/ (1- p) for p in plotting_positions]return clean_data, empirical_return_periodsend# Plot empirical vs theoreticalfunctionplot_empirical_vs_theoretical(station_data, gev_dist, station_info)# Start with the basic return period plot fig =plot_return_periods(gev_dist) ax = fig.content[1] # Get the actual Axis object# Update title to include station info ax.title ="Empirical vs Theoretical Return Periods\n$(station_info.noaa_id): $(station_info.name)"# Add empirical data with Weibull positions emp_levels, emp_periods =weibull_plotting_positions(station_data.rainfall)scatter!(ax, emp_periods, emp_levels, color =:red, markersize =10, label ="Observed Data (Weibull positions)")axislegend(ax, position =:rb)return figendplot_empirical_vs_theoretical(closest_rainfall, houston_gev, closest_station)
6 Analysis Questions
6.1 Temporal Trends
Explore the rainfall data to find one station where there appears to be a positive trend in rainfall over time and one station where there appears to be a negative trend. We will discuss more formal ways for thinking about trends in the coming weeks.
Station with apparent positive trend:
Write your answer here. Include the station name/ID and explain what you observe in the data that suggests an increasing trend.
Station with apparent negative trend:
Write your answer here. Include the station name/ID and explain what you observe in the data that suggests a decreasing trend.
6.2 GEV Parameter Experimentation
Starting Parameters vs Final Parameters:
Record your starting parameters and the final parameters you chose after experimentation:
Starting: \mu = 4.0, \sigma = 1.3, \xi = 0.15
Final: \mu = \ldots\sigma = \ldots, \xi = \ldots
Parameter Effects:
What happened when you increased/decreased \mu (location parameter)?
Test values like \mu = 2.0, 4.0, 6.0. Describe how the entire GEV curve shifted and how this affected the fit to your red data points.
What happened when you increased/decreased \sigma (scale parameter)?
Test values like \sigma = 0.8, 1.3, 2.0. Describe how the curve became steeper/flatter and how this affected the spread of return levels.
What happened when you changed \xi (shape parameter)?
Test \xi = -0.1 (light tail), \xi = 0.0 (Gumbel), \xi = 0.15, \xi = 0.3 (heavy tail). Focus on how the high return period end (right side) of the curve changed.
Final Assessment:
How well does your best-fit GEV distribution represent the closest station’s data?
Do the red dots follow your blue curve? Where are the biggest deviations?
Additional observations:
Optional: Add any other interesting patterns you notice in the return period analysis.