using Pkg
lab_dir = dirname(@__FILE__)
Pkg.activate(lab_dir)
# Pkg.instantiate() # uncomment this the first time you run the lab to install packages, then comment it backCEVE 543 Fall 2025 Lab 9: Weather Typing with Hidden Markov Models
Temporal clustering of synoptic patterns using HMMs
1 Background and Reading
This lab extends the weather typing analysis from Lab 8, replacing K-means clustering with Hidden Markov Models (HMMs). While K-means treats each day independently, HMMs explicitly model temporal dependencies between consecutive weather patterns. This is crucial for understanding weather persistence and transitions between regimes.
2 Code Setup
Begin with project management:
Load all required packages:
using YAXArrays
using NetCDF
using CairoMakie
using GeoMakie
using MultivariateStats
using Statistics
using LaTeXStrings
using DimensionalData
using NaturalEarth
using HiddenMarkovModels
using Distributions
using Random
using Dates
using LinearAlgebra: diagIt’s always good practice to set a random seed for reproducibility:
Random.seed!(543)TaskLocalRNG()
Define plotting helpers and colormaps:
cmap_absolute = :plasma
cmap_diverging = :PuOr
function diverging_colormap(data, colormap=:PuOr)
max_abs = maximum(abs.(skipmissing(data)))
return (colormap, (-max_abs, max_abs))
end
function create_geoaxis(fig, position; title="", kwargs...)
return GeoAxis(fig[position...],
title=title,
dest="+proj=latlong +datum=WGS84",
xgridvisible=false,
ygridvisible=false,
xticksvisible=false,
yticksvisible=false,
xticklabelsvisible=false,
yticklabelsvisible=false;
kwargs...)
end
# Load country boundaries for plotting
countries = naturalearth("admin_0_countries", 110)3 Data Loading and PCA
We consolidate the data loading and PCA into a single comprehensive function. This function handles all preprocessing steps from Lab 8:
"""
prepare_pca_data(data_file; n_pcs=4)
Load streamfunction data, calculate anomalies, and perform PCA.
Returns a named tuple with:
- ψ_climatology: Mean streamfunction field
- ψ_anomaly: Anomaly fields (3D YAXArray)
- pc_scores: PC time series (ntimes × n_pcs)
- eof_patterns: Spatial patterns (reshaped for plotting)
- pca_model: Fitted PCA model
- explained_var_ratio: Variance explained by each PC
- coordinates: (lon, lat, time) for plotting
"""
function prepare_pca_data(data_file; n_pcs=4)
# Load dataset
ds = open_dataset(data_file)
ψ = ds[:streamfunction] ./ 1e6
# Convert longitude from 0-360 to -180-180
lon_values = Array(ψ.lon)
lon_converted = ifelse.(lon_values .> 180, lon_values .- 360, lon_values)
ψ = DimensionalData.set(ψ, :lon => lon_converted)
# Load into memory
ψ = readcubedata(ψ)
# Calculate climatology and anomalies
ψ_climatology = mean(ψ, dims=:time)[:, :, 1]
ψ_anomaly = ψ .- ψ_climatology
# Convert to array and reshape for PCA
ψ_array = Array(ψ_anomaly)
nlons, nlats, ntimes = size(ψ_array)
ψ_matrix = reshape(ψ_array, (nlons * nlats, ntimes))
# Fit PCA model
pca_model = fit(PCA, ψ_matrix; maxoutdim=n_pcs)
# Extract PC scores and EOFs
pc_scores = MultivariateStats.transform(pca_model, ψ_matrix)'
eof_patterns_flat = projection(pca_model)
# Calculate explained variance
explained_var = principalvars(pca_model)
total_var = var(pca_model)
explained_var_ratio = explained_var ./ total_var
# Reshape EOFs for plotting
eof_patterns = [reshape(eof_patterns_flat[:, i], (nlons, nlats)) for i in 1:n_pcs]
# Store coordinates
coordinates = (
lon=Array(ψ_climatology.lon),
lat=Array(ψ_climatology.lat),
time=Array(lookup(ψ_anomaly, :time)),
)
return (
ψ_climatology=ψ_climatology,
ψ_anomaly=ψ_anomaly,
pc_scores=pc_scores,
eof_patterns=eof_patterns,
pca_model=pca_model,
explained_var_ratio=explained_var_ratio,
coordinates=coordinates,
)
endLoad and process the data. We use the same NCEP streamfunction data from Lab 8:
streamfunction_file = joinpath(lab_dir, "ncep_streamfunction_regional_1979_2025.nc")
pca_data = prepare_pca_data(streamfunction_file; n_pcs=4)Create helper function for adding boundaries:
lon_lims = extrema(pca_data.coordinates.lon)
lat_lims = extrema(pca_data.coordinates.lat)
function add_boundaries!(ax)
poly!(ax,
GeoMakie.to_multipoly(countries.geometry);
color=:transparent,
strokecolor=:black,
strokewidth=1.5,
)
xlims!(ax, lon_lims...)
ylims!(ax, lat_lims...)
endThus far, everything should look the same as Lab 8. You can visualize the EOF patterns and explained variance as before if you wish.
5 Understanding HMM Components
5.1 Transition Matrix
The transition matrix shows the probability of transitioning from one state to another. Diagonal elements represent persistence (staying in the same state).
let
trans_matrix = transition_matrix(hmm_trained)
fig = Figure(size=(800, 700))
ax = Axis(fig[1, 1],
xlabel="To State",
ylabel="From State",
title="Transition Probability Matrix",
xticks=1:n_states,
yticks=1:n_states,
aspect=DataAspect())
hm = heatmap!(ax, trans_matrix, colormap=:Blues, colorrange=(0, 1))
# Add text annotations
for i in 1:n_states
for j in 1:n_states
text!(ax, i, j,
text=string(round(trans_matrix[i, j], digits=2)),
align=(:center, :center),
color=:black,
fontsize=14)
end
end
Colorbar(fig[1, 2], hm, label="Probability")
fig
end5.2 Emission Distributions vs Composites
This is a key conceptual point: The emission distributions are multivariate normals in PC space, while composites are spatial averages.
First, examine the emission distribution parameters:
let
obs_dists = obs_distributions(hmm_trained)
println("Emission distribution parameters (mean in PC space):")
for i in 1:n_states
μ = mean(obs_dists[i])
σ = sqrt.(diag(cov(obs_dists[i])))
println("\nState $i:")
println(" Mean: ", round.(μ, digits=2))
println(" Std: ", round.(σ, digits=2))
end
endEmission distribution parameters (mean in PC space):
State 1:
Mean: [31.24, 1.69, 0.63, -0.4]
Std: [13.27, 10.18, 7.89, 3.7]
State 2:
Mean: [-3.0, 5.53, 2.68, -2.54]
Std: [12.27, 8.5, 6.44, 2.89]
State 3:
Mean: [0.21, 0.48, 4.86, 5.29]
Std: [14.97, 10.03, 8.3, 3.19]
State 4:
Mean: [4.15, -10.38, -2.2, -1.32]
Std: [13.01, 7.54, 6.8, 3.45]
State 5:
Mean: [-8.5, 3.35, -12.01, 0.81]
Std: [16.41, 11.23, 5.57, 4.24]
State 6:
Mean: [-33.54, -0.05, 3.3, -0.56]
Std: [13.9, 11.0, 8.92, 4.33]
Now create spatial maps in two different ways and compare them. First, reconstruct maps from the emission distribution means (the learned PC weights). Second, create composites by averaging all actual days assigned to each state:
let
# Load full anomaly array
ψ_array = Array(pca_data.ψ_anomaly)
nlons, nlats, ntimes = size(ψ_array)
# Method 1: Reconstruct from emission means (PC weights times EOF patterns)
obs_dists = obs_distributions(hmm_trained)
emission_maps = zeros(nlons, nlats, n_states)
for k in 1:n_states
μ = mean(obs_dists[k]) # Mean PC scores for this state
# Use MultivariateStats.reconstruct to properly inverse transform
ψ_reconstructed = MultivariateStats.reconstruct(pca_data.pca_model, μ)
emission_maps[:, :, k] = reshape(ψ_reconstructed, nlons, nlats)
end
# Method 2: Composite from actual days
# Filter to only NDJF seasons (matching the HMM training data)
times = pca_data.coordinates.time
years = unique(year.(times))
ndjf_mask = falses(length(times))
for y in years
season_mask = ((year.(times) .== y) .& (month.(times) .>= 11)) .|
((year.(times) .== y + 1) .& (month.(times) .<= 2))
ndjf_mask .|= season_mask
end
ψ_ndjf = ψ_array[:, :, ndjf_mask]
composites = zeros(nlons, nlats, n_states)
for k in 1:n_states
cluster_days = findall(all_states .== k)
if length(cluster_days) > 0
cluster_data = ψ_ndjf[:, :, cluster_days]
composites[:, :, k] = mean(cluster_data, dims=3)[:, :, 1]
end
end
# Determine color ranges separately
# Emission means are in PCA-reconstructed space
# Composites are in original anomaly space
crange_emission = maximum(abs.(extrema(emission_maps)))
crange_emission = (-crange_emission, crange_emission)
crange_composite = maximum(abs.(extrema(composites)))
crange_composite = (-crange_composite, crange_composite)
# Grid layout (3 columns, flexible rows)
ncols = 3
nrows = ceil(Int, n_states / ncols)
# Plot 1: Emission mean reconstructions
fig1 = Figure(size=(1200, 400 * nrows))
for k in 1:n_states
row = div(k - 1, ncols) + 1
col = mod(k - 1, ncols) + 1
ax = create_geoaxis(fig1, (row, col); title="State $k: Emission Mean")
heatmap!(ax, pca_data.coordinates.lon, pca_data.coordinates.lat,
emission_maps[:, :, k]; colormap=cmap_diverging, colorrange=crange_emission)
add_boundaries!(ax)
end
Colorbar(fig1[:, ncols+1], limits=crange_emission, colormap=cmap_diverging,
label=L"$\psi$ Anomaly ($10^6$ m$^2$/s)")
fig1
endlet
# Load full anomaly array
ψ_array = Array(pca_data.ψ_anomaly)
nlons, nlats, ntimes = size(ψ_array)
# Filter to NDJF seasons (matching HMM training data)
times = pca_data.coordinates.time
years = unique(year.(times))
ndjf_mask = falses(length(times))
for y in years
season_mask = ((year.(times) .== y) .& (month.(times) .>= 11)) .|
((year.(times) .== y + 1) .& (month.(times) .<= 2))
ndjf_mask .|= season_mask
end
ψ_ndjf = ψ_array[:, :, ndjf_mask]
# Composite from actual days
composites = zeros(nlons, nlats, n_states)
for k in 1:n_states
cluster_days = findall(all_states .== k)
if length(cluster_days) > 0
cluster_data = ψ_ndjf[:, :, cluster_days]
composites[:, :, k] = mean(cluster_data, dims=3)[:, :, 1]
end
end
# Color range based on composites
crange = maximum(abs.(extrema(composites)))
crange = (-crange, crange)
# Grid layout
ncols = 3
nrows = ceil(Int, n_states / ncols)
# Plot 2: Composites
fig2 = Figure(size=(1200, 400 * nrows))
for k in 1:n_states
row = div(k - 1, ncols) + 1
col = mod(k - 1, ncols) + 1
ax = create_geoaxis(fig2, (row, col);
title="State $k: Composite (n=$(sum(all_states .== k)))")
heatmap!(ax, pca_data.coordinates.lon, pca_data.coordinates.lat,
composites[:, :, k]; colormap=cmap_diverging, colorrange=crange)
add_boundaries!(ax)
end
Colorbar(fig2[:, ncols+1], limits=crange, colormap=cmap_diverging,
label=L"$\psi$ Anomaly ($10^6$ m$^2$/s)")
fig2
endThe emission distributions define probability densities in 4D PC space (multivariate normals). The composite maps show the average circulation pattern for all days assigned to each state.
While related, these are different representations:
- Emission distributions tell us the typical PC values for each state
- Composites show the typical spatial pattern in the original data space
- Days assigned to a state have PC values drawn from the emission distribution
- The composite is the average of projecting those PC values back to physical space
6 Analysis Questions
K-means assigns each day to its nearest cluster center in PC space, treating all days independently. HMMs learn both where states live in PC space (emission distributions) AND how they transition over time (transition matrix). Even with random initialization, Baum-Welch discovers temporal structure if it exists in the data. The HMM might assign a day to a “suboptimal” state (in terms of PC distance alone) if doing so makes the temporal sequence more probable overall. K-means cannot do this - it fundamentally ignores temporal order.
Fit HMMs with 3, 4, 5, and 7 states. How does the transition matrix structure change? Is temporal persistence stronger with fewer or more states?
Run the same HMM configuration (6 states, 4 PCs) with 3 different random seeds. How different are the resulting state patterns and transition matrices?
Compare results to the K-means clustering from lab 8. Where do they agree and disagree? Why?
Write code to generate synthetic weather sequences from your fitted HMM using the
rand()function. Create a 120-day sequence with corresponding PC values and spatial maps for a few days. Compare state frequencies, transition frequencies, and PC distributions between real and synthetic data.Describe how you would extend this to predict rainfall. What emission distribution would you use? How would you validate whether HMM states relate to rainfall patterns? What are the fundamental limitations for extreme rainfall?