Lab 08: \(K\)-Means Clustering

Module 3
Labs
Published

Fri., Nov. 17

1 K-Means Clustering

\(K\)-Means Clustering is a widely-used unsupervised machine learning algorithm, ideal for partitioning datasets into distinct, non-overlapping groups or ‘clusters’. We’ve seen it in the context of regional frequency analysis.

1.1 Algorithm

  • Inputs: \(k\) (number of clusters), \(\vb{x} = \{x_1, x_2, x_n\}\) (data points)
  • Outputs: \(\vb{c} = \{\mu_1, \mu_2, \ldots, \mu_n\}\) (cluster assignments), \(\vb{\mu} = \{\mu_1, \mu_2, \ldots, \mu_k \}\) (cluster centroids)
  • Steps:
    1. Randomly initialize \(K\) cluster centers: \(\vb{\mu} = \mu_1^{(0)}, \mu_2^{(0)}, \ldots, \mu_k^{(0)} \in \mathbb{R}^d\)
    2. Iterate until convergence:
      1. Assign each observation \(x_i\) to the closest (in Euclidean distance) mean: \[c_i^{(j)} = \arg_{k=1, \ldots, K} \min \|x_i - \mu_k^{(j)} \|\]
      2. Recompute each \(\mu_k^{(j)}\) as the mean of all points assigned to it
      3. Terminate when the total change of the cluster centroids satisfies \[ \sum_{k=1}^K \| \mu_k^{(j)} - \mu_k^{(j-1)} \| < \tau\]

2 Instructions

using CSV
using DataFrames
using Plots
using StatsBase: mean, std

We will work in an external script. Open the file kmeans.jl and edit the functions provided. It’s a Julia file, so you can run line by line and work in the REPL.

To make the functions created available to you here, run the following command:

include("kmeans.jl")
check_convergence (generic function with 1 method)

2.1 Initialize Centroids

First, edit the init_centroids function. It takes in a matrix \(X_{n \times d}\) indexed by \(n\) observations and \(d\) features, and returns a matrix with \(K\) rows (one for each centroid) and \(d\) columns (one for each feature) where \(d\) is the number of features of \(X\). The code provided initializes each centroid to a random value.

You can change this to whatever you like – be sure to explain your reasoning. One common approach is to choose \(k\) random observations from the dataset as your initial centroids. Be sure to make sure that your centroids are distinct!

2.2 Euclidean Distance

In order to assign observations to clusters, we need to be able to compute the distance between between an observation and a centroid. We will use the Euclidean distance, which is defined above. This function should take in two generic vectors and return a scalar.

2.3 Assign Clusters

There is just one line of code to edit here.

The argmin function may be your friend.

2.4 Update Centroids

As you loop through the algorithm, you will need to update the centroids. This function takes in the data matrix \(X\), the cluster assignments \(\vb{c}\), which is a vector of integers, and the number of clusters \(k\). It returns a matrix with \(K\) rows (one for each centroid) and \(d\) columns (one for each feature) where \(d\) is the number of features of \(X\).

2.5 \(K\)-means algorithm

This function is provided for you. You do not need to edit it. You simply need to define all the functions it calls.

function kmeans(X::AbstractMatrix, k::Int; τ=1e-5, maxiter=500)
    n, d = size(X) # get the number of observations and features

    # initialize the cluster centroids (μ)
    μ = init_centroids(X, k)
    μ_history = [μ]

    is_converged = false # initialize the flag
    j = 1 # initialize the counter

    # go through the loop until convergence is reached
    while !is_converged
        cluster_assignments = assign_clusters(X, μ)
        cluster_centroids = update_centroids(X, cluster_assignments, k) # update the centroids

        # add the current centroids to the history
        push!(μ_history, cluster_centroids)

        # check for convergence
        is_converged = check_convergence(μ_history, τ) # check for convergence

        # if convergence seems unlikely, stop
        if j > maxiter
            @warn "Failed to converge after $j iterations"
            return cluster_assignments, μ_history
        end

        # increase the counter
        j += 1
    end

    cluster_assignments = assign_clusters(X, μ)

    return cluster_assignments, μ_history
end

3 Analysis

Our input data for this clustering analysis will be stations from the GHCND dataset (original here). We will subset only stations in Texas, and we will cluster on their longitude and latitude.

Code
# Define a function to parse each line
function parse_line(line)
    station = strip(line[1:11])                # Station ID
    latitude = parse(Float64, strip(line[13:20]))  # Latitude
    longitude = parse(Float64, strip(line[22:30])) # Longitude
    elevation = parse(Float64, strip(line[32:37])) # Elevation
    state = strip(line[39:40])                 # State Abbreviation (if present)
    name = strip(line[41:end])                 # Station Name
    return (station, latitude, longitude, elevation, state, name)
end

# Read the file and process each line
function read_file(filename)
    data = []
    open(filename) do file
        for line in eachline(file)
            push!(data, parse_line(line))
        end
    end
    return DataFrame(data, [:Station, :Latitude, :Longitude, :Elevation, :State, :Name])
end

# Usage
filename = "data/ghcnd_stations.txt"
stations = read_file(filename)
stations = stations[stations[!, "State"].=="TX", :]

describe(stations)
6×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 Station US1TXAC0002 USW00093987 0 SubString{String}
2 Latitude 30.8044 25.8523 30.4333 36.473 0 Float64
3 Longitude -98.2393 -106.627 -97.9 -93.5653 0 Float64
4 Elevation 307.449 -999.9 214.6 2363.7 0 Float64
5 State TX TX 0 SubString{String}
6 Name ABERNATHY ZEPHYR 2.2 ESE 0 SubString{String}

Now we can run the clustering analysis we’ve implemented

X = Matrix(stations[!, [:Latitude, :Longitude]])
K = 10 # choose your own!
cluster_assignments, μ_history = kmeans(X, K)
┌ Warning: Failed to converge after 501 iterations
└ @ Main In[4]:24
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [[0.026492892069803764 0.6719695313458847; 0.6125376941428193 0.42399726096521995; … ; 0.8032262774466873 0.0011165559861527896; 0.9702552148264244 0.4853642473133476], [0.3873660519478963 0.9943918753042464; 0.2947085674434836 0.8927269440550277; … ; 0.5430994808378489 0.9952179166921281; 0.306263759025995 0.4885014362565502], [0.3237840193981305 0.7082888828514513; 0.969500342303624 0.04699121355826641; … ; 0.24322665868046345 0.13143861783711586; 0.7761242190286011 0.6322049426045429], [0.7244893425309452 0.4123552321790558; 0.04229419772575549 0.37890663404642533; … ; 0.09122433885264036 0.8472207587520619; 0.3998665615839301 0.9649051869846046], [0.3457214717490117 0.4055001125036064; 0.16407353362048283 0.6430742476973771; … ; 0.9488132205895787 0.46578923629888414; 0.4924598833852847 0.3430123113221907], [0.4755506149578913 0.6244384189598056; 0.31669146194103526 0.363855784708345; … ; 0.584775190027098 0.17903446258629607; 0.5392385549377782 0.45555637634111146], [0.9106582002728313 0.8011437441113876; 0.7904083227439441 0.9329213213323075; … ; 0.3857414456608679 0.7963238031398839; 0.9412080684834561 0.14475628109666938], [0.7860774551228487 0.6598501399126644; 0.15132396974907514 0.7258990280067777; … ; 0.2067690445273207 0.6538598743024576; 0.1569094421320809 0.6354807452838253], [0.040262257221564735 0.03365256165970354; 0.30452258569259094 0.2677020763840323; … ; 0.16519465426465352 0.0563791835662818; 0.5730070269229958 0.968824705806143], [0.524030688158974 0.8922849960486108; 0.9621960949784353 0.7557026983215239; … ; 0.5776433451189759 0.48583258775968186; 0.9359295775469337 0.18547075595388463]  …  [0.627541878301697 0.8620326715578946; 0.0918105629262318 0.35656219272182466; … ; 0.05811831474976914 0.47125862077058644; 0.2600070481685698 0.20502599066768346], [0.21880084595945226 0.262660785405573; 0.3980158073447543 0.8086846652120885; … ; 0.7154307385896953 0.17559299112792304; 0.6721178556073057 0.7999097348117953], [0.48898991071198683 0.08698703973330946; 0.4214845067091003 0.15940582677853077; … ; 0.879737626930729 0.14044059384395313; 0.488348663741036 0.4566341684409242], [0.8824470813262106 0.32761819484046606; 0.23286831543883302 0.5369698778278187; … ; 0.6569557028381883 0.8547890557873212; 0.1833597329487372 0.7947692602551265], [0.6851372940268539 0.5285507925725615; 0.4596198822858255 0.887818621488009; … ; 0.09133549090222137 0.7370968415910042; 0.4416218916304818 0.9401495764524012], [0.9115869582214606 0.5653218173388008; 0.06525306753322302 0.2459685158646604; … ; 0.8424228785028088 0.0042930075016467395; 0.6083660772474857 0.7438260265311932], [0.24370340792703948 0.8441647114452596; 0.2247502553630253 0.9774959001687072; … ; 0.7529235948674513 0.6137938551842508; 0.5542549899890917 0.9719843228998624], [0.43347252893670196 0.02529913855791477; 0.07892553569418825 0.4095229601399244; … ; 0.6538400929117922 0.5124862517825101; 0.3167558967832663 0.32374897685169224], [0.7971118313641609 0.3800015031240914; 0.9339855701875112 0.5498781062422777; … ; 0.49019030178568235 0.7106229566331145; 0.019638528555240287 0.7584298889025393], [0.07020060817724316 0.03779293510512671; 0.288739234126807 0.05431126002506448; … ; 0.335633739844835 0.5818389792587897; 0.8257591748144802 0.5642093496365211]])

4 Analysis

Once your code appears to be working:

  1. Plot your cluster assignments on a map. Do they look logical? What does / does not make sense to you?
  2. Check for consistency by re-running your code and comparing the plots. Do they look the same? Why or why not?
  3. Plot the (two-dimensional) cluster centroids as a function of the number of iterations.
  4. Try different values of \(k\). How does the clustering change? What is the best value of \(k\) in your opinion? How could you determine this?

If you have extra time, alter other parts of the code (e.g., use elevation to cluster, or use a different distance metric). How does this change your results?