using Pkg
lab_dir = dirname(@__FILE__)
Pkg.activate(lab_dir)
# Pkg.instantiate() # uncomment this the first time you run the lab to install packages, then comment it backCEVE 543 Fall 2025 Lab 7: Bias Correction Implementation
Delta method and quantile mapping for temperature bias correction
1 Background and Goals
Climate models have systematic biases that directly affect impact assessments, arising from coarse spatial resolution, parameterization of sub-grid processes, representation of topography, and errors in simulated circulation patterns. This lab implements two widely-used bias correction approaches: the delta method and quantile-quantile (QQ) mapping. The delta method preserves the climate model’s change signal while anchoring absolute values to observations, whereas QQ-mapping corrects the full distribution of values.
Both methods assume stationarity—that the statistical relationship between model and observations remains constant across climate states. This assumption may not hold under significant climate change. We’ll explore the strengths and limitations of each method using temperature data for Boston, providing hands-on experience before PS2 Part 1.
2 Study Location and Data
This lab uses temperature data for Boston Logan International Airport (Station ID: USW00014739, 42.3631^\circN, 71.0064^\circW). Observational data comes from GHCN-Daily (1936-2024) and is provided in the USW00014739.csv file in degrees Celsius. Climate model data comes from the GFDL-ESM4 model’s 3-hourly near-surface air temperature (tas), pre-downloaded from Google Cloud Storage for both historical (1850-2014) and SSP3-7.0 (2015-2100) scenarios. Refer to Lab 6 for details on CMIP6 data structure.
2.1 Data Processing Notes
The GHCN data provides daily average temperature in degrees Celsius. CMIP6 provides 3-hourly instantaneous temperature in Kelvin, which we’ll convert to Celsius and aggregate to daily averages. The pre-downloaded NetCDF files (boston_historical.nc and boston_ssp370.nc) contain 3-hourly surface air temperature for the grid cell nearest to Boston. We aggregate 8 consecutive 3-hour periods to approximate daily averages, ignoring daylight saving time—a simplification typical in many bias correction applications. If you’re interested in applying these methods to a different location, the download_data.jl script demonstrates how to extract CMIP6 data from Google Cloud Storage.
Before starting the lab, uncomment the Pkg.instantiate() line in the first code block and run it to install all required packages. This will take a few minutes the first time. After installation completes, comment the line back out to avoid reinstalling on subsequent runs.
3 Lab Implementation
3.1 Package Setup
using CSV, CairoMakie, DataFrames, Dates, LaTeXStrings, NCDatasets, Statistics, StatsBase, TidierData
ENV["DATAFRAMES_ROWS"] = 6
CairoMakie.activate!()
# Constants for plotting
const MONTH_NAMES = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
const MONTH_LABELS = (1:12, MONTH_NAMES)3.2 Task 1: Load and Process Data
We begin by loading both observational and climate model data. The observational data from GHCN-Daily spans 1936-2024, while the CMIP6 data requires processing from 3-hourly to daily resolution. We’ll use the overlapping historical period (1995-2014) to calibrate bias corrections.
Load the Boston GHCN temperature data from USW00014739.csv and filter to years with at least 80% complete data. Then load the pre-downloaded CMIP6 NetCDF files and aggregate the 3-hourly temperature data to daily values for the historical (1995-2014) and near-term future (2020-2040) periods. The helper functions load_cmip6_data() and aggregate_to_daily() are provided below. Finally, visualize the annual cycle comparing observations vs the historical GCM simulation.
# Load and clean the observational data
data_path = joinpath(lab_dir, "USW00014739.csv")
df = @chain begin
CSV.read(data_path, DataFrame)
# Compute daily average from min and max, preserving missing values
@mutate(
TAVG = ifelse.(ismissing.(TMIN) .| ismissing.(TMAX), missing, (TMIN .+ TMAX) ./ 2),
)
# GHCN data is in tenths of degrees C, so divide by 10
@mutate(TAVG = TAVG / 10.0)
# Rename DATE column to lowercase for consistency
@rename(date = DATE)
# Extract year and month for aggregation
@mutate(year = year(date), month = month(date))
# Keep only the columns we need
@select(date, year, month, TAVG)
end
# Filter to years with at least 80% complete data
# This removes years with too many missing observations
yearly_counts = @chain df begin
@group_by(year)
@summarize(n_obs = sum(!ismissing(TAVG)), n_total = n())
@mutate(frac_complete = n_obs / n_total)
end
good_years_df = @chain yearly_counts begin
@filter(frac_complete >= 0.8)
end
good_years = good_years_df.year
# Keep only complete years and drop remaining missing values
df_clean = @chain df begin
@filter(year in !!good_years)
dropmissing(:TAVG)
end# Helper functions for CMIP6 data
"""
Load CMIP6 temperature data from a local NetCDF file and convert times to DateTime.
This function reads NetCDF files (the standard format for CMIP6 data) and extracts:
- tas_data: temperature values in Kelvin
- time_cf: time coordinates in CF (Climate & Forecast) format
The CF time format is then converted to Julia's DateTime type for easier manipulation.
"""
function load_cmip6_data(file_path::String)
ds = NCDataset(file_path)
tas_data = ds["tas"][:] # Temperature at surface (tas)
time_cf = ds["time"][:] # Time in CF format
close(ds)
# Convert CF time objects to Julia DateTime
# We extract year, month, day, hour, minute, second from each CF time
time_data = [DateTime(
Dates.year(t), Dates.month(t), Dates.day(t),
Dates.hour(t), Dates.minute(t), Dates.second(t)
) for t in time_cf]
return tas_data, time_data
end
"""
Aggregate 3-hourly temperature data to daily averages.
CMIP6 provides 3-hourly instantaneous temperatures (8 values per day).
We compute daily averages by grouping consecutive 3-hour periods.
This ignores daylight saving time - a common simplification in climate studies.
"""
function aggregate_to_daily(tas_3hr, time_3hr)
n_3hr_per_day = 8 # 24 hours / 3 hours per timestep
n_days = div(length(tas_3hr), n_3hr_per_day)
# Preallocate output arrays
daily_temp = Vector{Float64}(undef, n_days)
daily_dates = Vector{Date}(undef, n_days)
for i in 1:n_days
# Get indices for this day's 8 timesteps
idx_start = (i - 1) * n_3hr_per_day + 1
idx_end = i * n_3hr_per_day
# Compute mean, handling any missing values
daily_vals = collect(skipmissing(tas_3hr[idx_start:idx_end]))
daily_temp[i] = isempty(daily_vals) ? NaN : mean(daily_vals)
# Use the first timestep's date as the day's date
daily_dates[i] = Date(time_3hr[idx_start])
end
# Convert from Kelvin to Celsius
daily_temp_c = daily_temp .- 273.15
return daily_temp_c, daily_dates
end
# Load the full CMIP6 files (historical and SSP3-7.0 scenario)
hist_file = joinpath(lab_dir, "boston_historical.nc")
ssp370_file = joinpath(lab_dir, "boston_ssp370.nc")
tas_hist_3hr, time_hist_3hr = load_cmip6_data(hist_file)
tas_ssp370_3hr, time_ssp370_3hr = load_cmip6_data(ssp370_file)
# Extract and process historical period (1995-2014) for bias correction calibration
hist_start = DateTime(1995, 1, 1)
hist_end = DateTime(2014, 12, 31, 23, 59, 59)
hist_idx = (time_hist_3hr .>= hist_start) .& (time_hist_3hr .<= hist_end)
tas_hist_daily, dates_hist_daily = aggregate_to_daily(tas_hist_3hr[hist_idx], time_hist_3hr[hist_idx])
# Extract and process near-term future period (2020-2040) to apply corrections
near_start = DateTime(2020, 1, 1)
near_end = DateTime(2040, 12, 31, 23, 59, 59)
near_idx = (time_ssp370_3hr .>= near_start) .& (time_ssp370_3hr .<= near_end)
tas_ssp370_near_daily, dates_ssp370_near_daily = aggregate_to_daily(tas_ssp370_3hr[near_idx], time_ssp370_3hr[near_idx])
# Create DataFrames for easier manipulation with TidierData
df_gcm_hist = DataFrame(
date=dates_hist_daily, temp=tas_hist_daily,
year=year.(dates_hist_daily), month=month.(dates_hist_daily)
)
df_ssp370_near = DataFrame(
date=dates_ssp370_near_daily, temp=tas_ssp370_near_daily,
year=year.(dates_ssp370_near_daily), month=month.(dates_ssp370_near_daily)
)
# Filter observations to match the GCM historical period
df_obs_hist = @chain df_clean begin
@filter(year >= 1995 && year <= 2014)
end
# Compute monthly climatologies (long-term monthly averages)
obs_monthly = @chain df_obs_hist begin
@group_by(month)
@summarize(mean_temp = mean(TAVG))
end
gcm_monthly = @chain df_gcm_hist begin
@group_by(month)
@summarize(mean_temp = mean(temp))
endlet
fig = Figure()
ax = Axis(fig[1, 1],
xlabel="Month",
ylabel=L"Temperature ($^\circ$C)",
title="Annual Cycle: Observations vs GCM Historical (1995-2014)",
xticks=MONTH_LABELS)
lines!(ax, obs_monthly.month, obs_monthly.mean_temp, linewidth=2, color=:steelblue, label="Observations")
lines!(ax, gcm_monthly.month, gcm_monthly.mean_temp, linewidth=2, color=:coral, label="GCM Historical")
axislegend(ax, position=:lt)
fig
end3.3 Task 2: Implement Delta Method
The delta method corrects the mean bias while preserving the climate model’s projected change signal. We calculate a monthly bias correction based on the historical period (1995-2014) and apply it to future projections.
Implement the additive delta method for temperature bias correction. For each calendar month m, calculate the mean bias: \Delta_m = \bar{T}^\text{GCM}_{\text{hist},m} - \bar{T}^\text{obs}_{\text{hist},m}. Then apply the correction to future values: T^\text{corr}_{\text{fut}}(d, m, y) = T^\text{GCM}_{\text{fut}}(d, m, y) - \Delta_m.
Follow these steps:
- Calculate monthly mean bias: Create a DataFrame called
monthly_biaswith columnsmonth,gcm_mean,obs_mean, andbias. Use@chain,@group_by,@summarize,@left_join, and@mutateto compute the bias for each month. - Visualize the bias: Create a bar plot using
barplot!()showingmonthly_bias.biasby month. Setxticks=MONTH_LABELS, add a horizontal line at zero withhlines!(), and label the y-axis as “Temperature Bias (^\circC)”. - Write the correction function: Implement
apply_delta_method(gcm_temps, gcm_dates, monthly_bias_df)that:- Creates a vector
corrected_tempsto store results - Loops through each temperature value
- Uses
month(date)to extract the month from each date - Looks up the bias for that month in
monthly_bias_df - Subtracts the bias from the temperature
- Creates a vector
- Apply the correction: Add a new column
temp_deltatodf_ssp370_nearby calling your function withdf_ssp370_near.temp,df_ssp370_near.date, andmonthly_bias. - Compute monthly climatologies: Create
near_monthly_rawandnear_monthly_deltaDataFrames by groupingdf_ssp370_nearby month and computing the mean oftempandtemp_deltarespectively. - Create comparison plot: Make a figure showing three lines (use
lines!()for each):obs_monthly(blue, labeled “Historical Obs”)near_monthly_raw(coral/red, labeled “GCM Raw”)near_monthly_delta(green with dashed linestyle, labeled “Delta Corrected”)
xticks=MONTH_LABELS, add a legend withaxislegend(), and title it “Delta Method: Annual Cycle for SSP3-7.0 Near-Term (2020-2040)”.
- Use
Dates.month()to extract the month number from aDateobject - To look up bias for month
m:monthly_bias_df[monthly_bias_df.month .== m, :bias][1] - Use
similar(gcm_temps)to create an output vector of the same size and type - The figure should show that delta correction shifts the GCM output down but preserves the warming signal
# Your code here3.4 Task 3: Implement Quantile-Quantile Mapping
Unlike the delta method, QQ-mapping transforms the entire probability distribution of model output to match observations. For each value in the future model output, we find its percentile in the historical model distribution, then map it to the same percentile in the historical observed distribution.
Implement QQ-mapping to correct the full distribution of temperature values.
Follow these steps:
- Extract temperature vectors: Create
obs_hist_tempsby convertingdf_obs_hist.TAVGtoFloat64, and creategcm_hist_tempsfromdf_gcm_hist.temp. - Fit empirical CDFs: Use
ecdf()from StatsBase.jl to createecdf_obsandecdf_gcmfrom the historical data. - Write the mapping function: Implement
apply_qqmap_empirical(gcm_temps, ecdf_gcm, obs_hist_temps)that:- Creates a vector
corrected_tempsto store results - Loops through each temperature value in
gcm_temps - Finds its percentile
pin the GCM distribution usingecdf_gcm(gcm_temps[i]) - Clamps
pto [0.001, 0.999] usingclamp(p, 0.001, 0.999) - Maps to the same percentile in observations using
quantile(obs_hist_temps, p)
- Creates a vector
- Apply the correction: Add a new column
temp_qqmaptodf_ssp370_nearby calling your function withdf_ssp370_near.temp,ecdf_gcm, andobs_hist_temps. - Create QQ-plot:
- Define
percentiles = 0.01:0.01:0.99 - Compute
obs_quantilesandgcm_hist_quantilesusingquantile()for each dataset - Create a scatter plot with
obs_quantileson the x-axis andgcm_hist_quantileson the y-axis - Add a 1:1 reference line using
lines!(ax, [-20, 40], [-20, 40], ...)with dashed linestyle - Use
aspect=DataAspect()to make the plot square - Label axes as “Observed Quantiles (^\circC)” and “GCM Quantiles (^\circC)”
- Define
Float64.(vector)converts a vector to Float64 typeecdf_gcm(value)returns the percentile (0 to 1) ofvaluein the GCM distributionquantile(obs_hist_temps, p)returns the temperature value at percentilepin the observed distribution- The
clamp(x, low, high)function constrainsxto the range [low, high] - Points on the 1:1 line in a QQ-plot indicate perfect distributional agreement
# Your code here3.5 Task 4: Compare Methods
Now we compare the delta method and QQ-mapping approaches to understand their strengths and limitations.
Create a comprehensive comparison figure showing monthly mean temperature for the near-term period (2020-2040).
Follow these steps:
- Compute QQ-mapped monthly climatology: Create
near_monthly_qqmapby groupingdf_ssp370_nearby month and computing the mean oftemp_qqmap. - Create comparison figure: Make a single figure with four lines using different colors and line styles:
- Historical observations (
obs_monthly.mean_temp): black, solid line, linewidth=2.5, label=“Historical Obs” - Raw GCM near-term (
near_monthly_raw.mean_temp): coral/red, solid line, linewidth=2, label=“Raw GCM” - Delta-corrected (
near_monthly_delta.mean_temp): green, dashed line (linestyle=:dash), linewidth=2, label=“Delta” - QQ-mapped (
near_monthly_qqmap.mean_temp): purple, dotted line (linestyle=:dot), linewidth=2, label=“QQ-map”
- Historical observations (
- Format the plot:
- Set
xticks=MONTH_LABELSto show month names - Title: “Annual Cycle Comparison: All Methods (Near-term 2020-2040)”
- Y-axis label: “Temperature (^\circC)” using LaTeX formatting
- Add legend with
axislegend(ax, position=:lt)
- Set
After creating the plot, examine it carefully and consider:
- How do the two correction methods differ in their treatment of future temperatures?
- What does the position of the QQ-mapped line reveal about this method’s treatment of the warming signal?
- Which method shows temperatures closer to historical observations? Why might this be problematic for climate impact assessment?
- All four DataFrames should have a
monthcolumn and a column with mean temperatures - Use distinct colors and linestyles to make the lines easily distinguishable
- The figure should clearly show that QQ-mapping dampens the warming signal compared to the delta method
# Your code here3.6 Task 5: Reflection
Finally, we reflect on the assumptions, limitations, and appropriate use cases for each bias correction method.
Write brief responses (2-3 sentences each) to the following questions:
Method selection: If you were providing climate data to support a decision about urban heat management in Boston, which bias correction method would you recommend and why?
Appropriateness of QQ-mapping: In the Ines and Hansen (2006) paper we discussed, QQ-mapping was used to correct both the frequency and intensity of rainfall for crop modeling. For temperature data, does it make sense to correct the full distribution? Why or why not? When might the delta method be more appropriate than QQ-mapping?
3.6.1 Method Selection for Urban Heat Management
Your response here
3.6.2 Appropriateness of QQ-Mapping for Temperature
Your response here