CEVE 543 Fall 2025 Lab 9: Weather Typing with Hidden Markov Models

Temporal clustering of synoptic patterns using HMMs

Author

CEVE 543 Fall 2025

Published

Fri., Nov. 7

1 Background and Reading

This lab extends the weather typing analysis from Lab 8, replacing K-means clustering with Hidden Markov Models (HMMs). While K-means treats each day independently, HMMs explicitly model temporal dependencies between consecutive weather patterns. This is crucial for understanding weather persistence and transitions between regimes.

1.1 Hidden Markov Models

An HMM consists of:

Hidden states: Unobserved weather regimes (what we want to identify)
Observations: PC scores we actually measure
Emission distributions: How each state generates observations (multivariate normal in PC space)
Transition matrix: Probabilities of switching between states from one day to the next
Initial distribution: Probability of starting in each state

2 Code Setup

Begin with project management:

using Pkg
lab_dir = dirname(@__FILE__)
Pkg.activate(lab_dir)
# Pkg.instantiate() # uncomment this the first time you run the lab to install packages, then comment it back

Load all required packages:

using YAXArrays
using NetCDF
using CairoMakie
using GeoMakie
using MultivariateStats
using Statistics
using LaTeXStrings
using DimensionalData
using NaturalEarth
using HiddenMarkovModels
using Distributions
using Random
using Dates
using LinearAlgebra: diag

It’s always good practice to set a random seed for reproducibility:

Random.seed!(543)

TaskLocalRNG()

Define plotting helpers and colormaps:

cmap_absolute = :plasma
cmap_diverging = :PuOr

function diverging_colormap(data, colormap=:PuOr)
    max_abs = maximum(abs.(skipmissing(data)))
    return (colormap, (-max_abs, max_abs))
end

function create_geoaxis(fig, position; title="", kwargs...)
    return GeoAxis(fig[position...],
        title=title,
        dest="+proj=latlong +datum=WGS84",
        xgridvisible=false,
        ygridvisible=false,
        xticksvisible=false,
        yticksvisible=false,
        xticklabelsvisible=false,
        yticklabelsvisible=false;
        kwargs...)
end

# Load country boundaries for plotting
countries = naturalearth("admin_0_countries", 110)

3 Data Loading and PCA

We consolidate the data loading and PCA into a single comprehensive function. This function handles all preprocessing steps from Lab 8:

"""
    prepare_pca_data(data_file; n_pcs=4)

Load streamfunction data, calculate anomalies, and perform PCA.

Returns a named tuple with:
- ψ_climatology: Mean streamfunction field
- ψ_anomaly: Anomaly fields (3D YAXArray)
- pc_scores: PC time series (ntimes × n_pcs)
- eof_patterns: Spatial patterns (reshaped for plotting)
- pca_model: Fitted PCA model
- explained_var_ratio: Variance explained by each PC
- coordinates: (lon, lat, time) for plotting
"""
function prepare_pca_data(data_file; n_pcs=4)
    # Load dataset
    ds = open_dataset(data_file)
    ψ = ds[:streamfunction] ./ 1e6

    # Convert longitude from 0-360 to -180-180
    lon_values = Array(ψ.lon)
    lon_converted = ifelse.(lon_values .> 180, lon_values .- 360, lon_values)
    ψ = DimensionalData.set(ψ, :lon => lon_converted)

    # Load into memory
    ψ = readcubedata(ψ)

    # Calculate climatology and anomalies
    ψ_climatology = mean(ψ, dims=:time)[:, :, 1]
    ψ_anomaly = ψ .- ψ_climatology

    # Convert to array and reshape for PCA
    ψ_array = Array(ψ_anomaly)
    nlons, nlats, ntimes = size(ψ_array)
    ψ_matrix = reshape(ψ_array, (nlons * nlats, ntimes))

    # Fit PCA model
    pca_model = fit(PCA, ψ_matrix; maxoutdim=n_pcs)

    # Extract PC scores and EOFs
    pc_scores = MultivariateStats.transform(pca_model, ψ_matrix)'
    eof_patterns_flat = projection(pca_model)

    # Calculate explained variance
    explained_var = principalvars(pca_model)
    total_var = var(pca_model)
    explained_var_ratio = explained_var ./ total_var

    # Reshape EOFs for plotting
    eof_patterns = [reshape(eof_patterns_flat[:, i], (nlons, nlats)) for i in 1:n_pcs]

    # Store coordinates
    coordinates = (
        lon=Array(ψ_climatology.lon),
        lat=Array(ψ_climatology.lat),
        time=Array(lookup(ψ_anomaly, :time)),
    )

    return (
        ψ_climatology=ψ_climatology,
        ψ_anomaly=ψ_anomaly,
        pc_scores=pc_scores,
        eof_patterns=eof_patterns,
        pca_model=pca_model,
        explained_var_ratio=explained_var_ratio,
        coordinates=coordinates,
    )
end

Load and process the data. We use the same NCEP streamfunction data from Lab 8:

streamfunction_file = joinpath(lab_dir, "ncep_streamfunction_regional_1979_2025.nc")
pca_data = prepare_pca_data(streamfunction_file; n_pcs=4)

Create helper function for adding boundaries:

lon_lims = extrema(pca_data.coordinates.lon)
lat_lims = extrema(pca_data.coordinates.lat)

function add_boundaries!(ax)
    poly!(ax,
        GeoMakie.to_multipoly(countries.geometry);
        color=:transparent,
        strokecolor=:black,
        strokewidth=1.5,
    )
    xlims!(ax, lon_lims...)
    ylims!(ax, lat_lims...)
end

Thus far, everything should look the same as Lab 8. You can visualize the EOF patterns and explained variance as before if you wish.

4 Hidden Markov Model Implementation

HMMs differ fundamentally from K-means by modeling temporal dependencies. While K-means asks “what spatial patterns exist?”, HMMs ask “how do weather patterns evolve in time?” The Baum-Welch algorithm learns both the emission distributions (where states live in PC space) and the transition probabilities (how states evolve day-to-day). This means HMM state assignments consider temporal context, not just instantaneous PC values.

4.1 Prepare Data as Multiple Sequences

HMMs always want to work with sequences. We split the data into separate NDJF (Nov-Feb) seasons, treating each as an independent sequence. For more, see the docs

function split_into_seasons(pc_scores, times)
    sequences = Vector{Vector{Vector{Float64}}}()

    # Get unique years
    years = unique(year.(times))

    for y in years
        # NDJF spans two calendar years (Nov-Dec of y, Jan-Feb of y+1)
        season_mask = ((year.(times) .== y) .& (month.(times) .>= 11)) .|
                      ((year.(times) .== y + 1) .& (month.(times) .<= 2))

        if sum(season_mask) > 0
            season_pcs = pc_scores[season_mask, :]
            # Convert each row to a vector and collect into a vector of vectors
            season_obs = [vec(season_pcs[i, :]) for i in 1:size(season_pcs, 1)]
            push!(sequences, season_obs)
        end
    end

    # Concatenate all sequences
    obs_concat = reduce(vcat, sequences)

    # Calculate sequence end indices
    seq_ends = cumsum(length.(sequences))

    return obs_concat, seq_ends
end

obs_concat, seq_ends = split_into_seasons(pca_data.pc_scores, pca_data.coordinates.time)
println("Number of NDJF seasons: $(length(seq_ends))")
println("Total observations: $(length(obs_concat))")
println("Example season length: $(seq_ends[1]) days")

Number of NDJF seasons: 46
Total observations: 5532
Example season length: 121 days

4.2 Initialize and Train HMM

Before we can train an HMM, we need to specify its structure. An HMM has three components:

the initial state distribution (what state does a sequence start in?)
the transition matrix (how do states evolve from one day to the next?), and
the emission distributions (what observations does each state generate?).

The Baum-Welch algorithm will learn all of these from data, but we need to provide starting values. A key modeling choice is what distribution to use for emissions. We’ll use multivariate normal distributions in PC space. The multivariate normal has nice properties: it’s fully characterized by its mean vector and covariance matrix, and it has closed-form update equations in the Baum-Welch algorithm. The mean vector defines the “center” of each weather regime in PC space, while the covariance captures how much variability there is around that center.

For the covariance structure, we have a choice between full covariance matrices (allowing correlations between different PCs within a state) or diagonal covariance matrices (assuming PCs vary independently within each state). We’ll use diagonal covariances for several reasons. First, PCs are orthogonal by construction across the entire dataset, so there’s little reason to expect strong correlations within individual states. Second, diagonal covariances reduce the number of parameters from \mathcal{O}(n_{pcs}^2) to \mathcal{O}(n_{pcs}) per state, making estimation more stable when data is limited. Third, they’re easier to interpret: each PC dimension has its own variance, telling us which modes are more or less variable within each regime. We could try full covariance matrices later as an experiment, but diagonal is a sensible default.

For initialization, we’ll use random starting values rather than trying to be clever. We draw initial state probabilities uniformly at random and normalize them to sum to one. We do the same for each row of the transition matrix, ensuring each row sums to one (these are conditional probabilities). For the emission distributions, we start with mean vectors drawn from \mathcal{N}(0, 1) and diagonal covariances with all variances set to 1.0. These choices are arbitrary but reasonable: the Baum-Welch algorithm will find better values during training.

One caution: Baum-Welch is an expectation-maximization (EM) algorithm, which means it’s sensitive to initialization. Different random seeds can lead to different local optima. More sophisticated initialization strategies exist (like using K-means cluster centers as initial emission means), but for now we’ll keep it simple and just note that you might want to try multiple random seeds to check sensitivity.

function initialize_hmm(n_states, n_dims; seed=42)
    Random.seed!(seed)

    # Random initial state distribution (normalized)
    init_probs = rand(n_states)
    init_probs ./= sum(init_probs)

    # Random transition matrix (each row sums to 1)
    trans_probs = rand(n_states, n_states)
    trans_probs ./= sum(trans_probs, dims=2)

    # Random emission distributions (MV Normal with diagonal covariance)
    # Mean: random N(0,1), Variance: 1.0 for each dimension
    emissions = [MvNormal(randn(n_dims), ones(n_dims)) for _ in 1:n_states]

    return HMM(init_probs, trans_probs, emissions)
end

n_states = 6  # Following Doss-Gollin et al. (2018)
n_pcs = size(pca_data.pc_scores, 2)

hmm_init = initialize_hmm(n_states, n_pcs)

Train the HMM using Baum-Welch algorithm:

# Train HMM on all sequences
hmm_trained, loglikelihood_evolution = baum_welch(hmm_init, obs_concat; seq_ends=seq_ends, max_iterations=100, atol=1e-3)

println("Training complete!")
println("Final log-likelihood: $(round(loglikelihood_evolution[end], digits=2))")

Training complete!
Final log-likelihood: -79392.96

Plot training convergence (skip the first iteration because it’s off the chart bad)

let
    fig = Figure(size=(800, 500))
    ax = Axis(fig[1, 1],
        xlabel="Iteration",
        ylabel="Log-Likelihood",
        title="HMM Training Convergence",
    )
    lines!(ax, 2:length(loglikelihood_evolution), loglikelihood_evolution[2:end],
        color=:steelblue, linewidth=2,
    )
    fig
end

4.3 Decode States with Viterbi Algorithm

Use the Viterbi algorithm to find the most likely state sequence:

# Decode all sequences at once
all_states, log_likelihoods = viterbi(hmm_trained, obs_concat; seq_ends=seq_ends)

println("State assignments for first 20 days:")
println(all_states[1:20])
println("\nNumber of sequences decoded: $(length(log_likelihoods))")

State assignments for first 20 days:
[2, 2, 2, 2, 2, 4, 4, 5, 5, 5, 6, 6, 3, 3, 1, 1, 4, 5, 5, 2]

Number of sequences decoded: 46

Visualize state assignments for one season:

let
    # Helper function to get sequence limits
    function seq_limits(seq_ends, idx)
        start_idx = idx == 1 ? 1 : seq_ends[idx-1] + 1
        end_idx = seq_ends[idx]
        return start_idx, end_idx
    end

    # Use first complete season (index 1)
    season_idx = 1
    start_idx, end_idx = seq_limits(seq_ends, season_idx)
    states = all_states[start_idx:end_idx]
    n_days = length(states)

    fig = Figure(size=(1200, 400))
    ax = Axis(fig[1, 1],
        xlabel="Day of Season",
        ylabel="Weather State",
        title="State Assignments for NDJF Season $(season_idx)",
        yticks=1:n_states)

    # Plot as step function
    stairs!(ax, 1:n_days, states, color=:steelblue, linewidth=2)

    fig
end

5 Understanding HMM Components

5.1 Transition Matrix

The transition matrix shows the probability of transitioning from one state to another. Diagonal elements represent persistence (staying in the same state).

let
    trans_matrix = transition_matrix(hmm_trained)

    fig = Figure(size=(800, 700))
    ax = Axis(fig[1, 1],
        xlabel="To State",
        ylabel="From State",
        title="Transition Probability Matrix",
        xticks=1:n_states,
        yticks=1:n_states,
        aspect=DataAspect())

    hm = heatmap!(ax, trans_matrix, colormap=:Blues, colorrange=(0, 1))

    # Add text annotations
    for i in 1:n_states
        for j in 1:n_states
            text!(ax, i, j,
                text=string(round(trans_matrix[i, j], digits=2)),
                align=(:center, :center),
                color=:black,
                fontsize=14)
        end
    end

    Colorbar(fig[1, 2], hm, label="Probability")
    fig
end

5.2 Emission Distributions vs Composites

This is a key conceptual point: The emission distributions are multivariate normals in PC space, while composites are spatial averages.

First, examine the emission distribution parameters:

let
    obs_dists = obs_distributions(hmm_trained)
    println("Emission distribution parameters (mean in PC space):")
    for i in 1:n_states
        μ = mean(obs_dists[i])
        σ = sqrt.(diag(cov(obs_dists[i])))
        println("\nState $i:")
        println("  Mean: ", round.(μ, digits=2))
        println("  Std:  ", round.(σ, digits=2))
    end
end

Emission distribution parameters (mean in PC space):

State 1:
  Mean: [31.24, 1.69, 0.63, -0.4]
  Std:  [13.27, 10.18, 7.89, 3.7]

State 2:
  Mean: [-3.0, 5.53, 2.68, -2.54]
  Std:  [12.27, 8.5, 6.44, 2.89]

State 3:
  Mean: [0.21, 0.48, 4.86, 5.29]
  Std:  [14.97, 10.03, 8.3, 3.19]

State 4:
  Mean: [4.15, -10.38, -2.2, -1.32]
  Std:  [13.01, 7.54, 6.8, 3.45]

State 5:
  Mean: [-8.5, 3.35, -12.01, 0.81]
  Std:  [16.41, 11.23, 5.57, 4.24]

State 6:
  Mean: [-33.54, -0.05, 3.3, -0.56]
  Std:  [13.9, 11.0, 8.92, 4.33]

Now create spatial maps in two different ways and compare them. First, reconstruct maps from the emission distribution means (the learned PC weights). Second, create composites by averaging all actual days assigned to each state:

let
    # Load full anomaly array
    ψ_array = Array(pca_data.ψ_anomaly)
    nlons, nlats, ntimes = size(ψ_array)

    # Method 1: Reconstruct from emission means (PC weights times EOF patterns)
    obs_dists = obs_distributions(hmm_trained)
    emission_maps = zeros(nlons, nlats, n_states)

    for k in 1:n_states
        μ = mean(obs_dists[k])  # Mean PC scores for this state
        # Use MultivariateStats.reconstruct to properly inverse transform
        ψ_reconstructed = MultivariateStats.reconstruct(pca_data.pca_model, μ)
        emission_maps[:, :, k] = reshape(ψ_reconstructed, nlons, nlats)
    end

    # Method 2: Composite from actual days
    # Filter to only NDJF seasons (matching the HMM training data)
    times = pca_data.coordinates.time
    years = unique(year.(times))
    ndjf_mask = falses(length(times))

    for y in years
        season_mask = ((year.(times) .== y) .& (month.(times) .>= 11)) .|
                      ((year.(times) .== y + 1) .& (month.(times) .<= 2))
        ndjf_mask .|= season_mask
    end

    ψ_ndjf = ψ_array[:, :, ndjf_mask]
    composites = zeros(nlons, nlats, n_states)

    for k in 1:n_states
        cluster_days = findall(all_states .== k)
        if length(cluster_days) > 0
            cluster_data = ψ_ndjf[:, :, cluster_days]
            composites[:, :, k] = mean(cluster_data, dims=3)[:, :, 1]
        end
    end

    # Determine color ranges separately
    # Emission means are in PCA-reconstructed space
    # Composites are in original anomaly space
    crange_emission = maximum(abs.(extrema(emission_maps)))
    crange_emission = (-crange_emission, crange_emission)

    crange_composite = maximum(abs.(extrema(composites)))
    crange_composite = (-crange_composite, crange_composite)

    # Grid layout (3 columns, flexible rows)
    ncols = 3
    nrows = ceil(Int, n_states / ncols)

    # Plot 1: Emission mean reconstructions
    fig1 = Figure(size=(1200, 400 * nrows))

    for k in 1:n_states
        row = div(k - 1, ncols) + 1
        col = mod(k - 1, ncols) + 1

        ax = create_geoaxis(fig1, (row, col); title="State $k: Emission Mean")
        heatmap!(ax, pca_data.coordinates.lon, pca_data.coordinates.lat,
            emission_maps[:, :, k]; colormap=cmap_diverging, colorrange=crange_emission)
        add_boundaries!(ax)
    end

    Colorbar(fig1[:, ncols+1], limits=crange_emission, colormap=cmap_diverging,
        label=L"$\psi$ Anomaly ($10^6$ m$^2$/s)")

    fig1
end

let
    # Load full anomaly array
    ψ_array = Array(pca_data.ψ_anomaly)
    nlons, nlats, ntimes = size(ψ_array)

    # Filter to NDJF seasons (matching HMM training data)
    times = pca_data.coordinates.time
    years = unique(year.(times))
    ndjf_mask = falses(length(times))

    for y in years
        season_mask = ((year.(times) .== y) .& (month.(times) .>= 11)) .|
                      ((year.(times) .== y + 1) .& (month.(times) .<= 2))
        ndjf_mask .|= season_mask
    end

    ψ_ndjf = ψ_array[:, :, ndjf_mask]

    # Composite from actual days
    composites = zeros(nlons, nlats, n_states)

    for k in 1:n_states
        cluster_days = findall(all_states .== k)
        if length(cluster_days) > 0
            cluster_data = ψ_ndjf[:, :, cluster_days]
            composites[:, :, k] = mean(cluster_data, dims=3)[:, :, 1]
        end
    end

    # Color range based on composites
    crange = maximum(abs.(extrema(composites)))
    crange = (-crange, crange)

    # Grid layout
    ncols = 3
    nrows = ceil(Int, n_states / ncols)

    # Plot 2: Composites
    fig2 = Figure(size=(1200, 400 * nrows))

    for k in 1:n_states
        row = div(k - 1, ncols) + 1
        col = mod(k - 1, ncols) + 1

        ax = create_geoaxis(fig2, (row, col);
            title="State $k: Composite (n=$(sum(all_states .== k)))")
        heatmap!(ax, pca_data.coordinates.lon, pca_data.coordinates.lat,
            composites[:, :, k]; colormap=cmap_diverging, colorrange=crange)
        add_boundaries!(ax)
    end

    Colorbar(fig2[:, ncols+1], limits=crange, colormap=cmap_diverging,
        label=L"$\psi$ Anomaly ($10^6$ m$^2$/s)")

    fig2
end

Key Distinction

The emission distributions define probability densities in 4D PC space (multivariate normals). The composite maps show the average circulation pattern for all days assigned to each state.

While related, these are different representations:

Emission distributions tell us the typical PC values for each state
Composites show the typical spatial pattern in the original data space
Days assigned to a state have PC values drawn from the emission distribution
The composite is the average of projecting those PC values back to physical space

6 Analysis Questions

K-means assigns each day to its nearest cluster center in PC space, treating all days independently. HMMs learn both where states live in PC space (emission distributions) AND how they transition over time (transition matrix). Even with random initialization, Baum-Welch discovers temporal structure if it exists in the data. The HMM might assign a day to a “suboptimal” state (in terms of PC distance alone) if doing so makes the temporal sequence more probable overall. K-means cannot do this - it fundamentally ignores temporal order.

Fit HMMs with 3, 4, 5, and 7 states. How does the transition matrix structure change? Is temporal persistence stronger with fewer or more states?
Run the same HMM configuration (6 states, 4 PCs) with 3 different random seeds. How different are the resulting state patterns and transition matrices?
Compare results to the K-means clustering from lab 8. Where do they agree and disagree? Why?
Write code to generate synthetic weather sequences from your fitted HMM using the rand() function. Create a 120-day sequence with corresponding PC values and spatial maps for a few days. Compare state frequencies, transition frequencies, and PC distributions between real and synthetic data.
Describe how you would extend this to predict rainfall. What emission distribution would you use? How would you validate whether HMM states relate to rainfall patterns? What are the fundamental limitations for extreme rainfall?