using Dates
using MultivariateStats
using Plots
using NCDatasets
using StatsBase
using Unitful
default(; margin=4Plots.mm, size=(700, 400), linewidth=2) Plots.
Lab 06
\(K\)NN and PCA
1 Intro
Today we are going to build a simple model to simulate precipitation, conditional on temperature fields.
1.1 Why might this be useful?
Precipitation is notoriously difficult to forecast at high spatial and temporal resolution. Temperature, by contrast, is much smoother and often easier to forecast. We might imagine seeking to generate high-resolution precipitation fields from climate models that have substantial biases in representing precipitation.
1.2 Algorithm
For your reference, here is the full algorithm we will implement today. We’ll break this into bite-sized pieces for you, and then at the end we’ll implement the whole thing as a function in Julia.
- Inputs
- Past temperature fields \(T_{\text{past}}\)
- Current temperature field \(T_{\text{current}}\) for which we are making predictions
- Number of nearest neighbors \(K\)
- Number of principal components \(n_{\text{PC}}\)
- Output
- Sampled precipitation field \(P\)
- Steps
- Preprocessing:
- Preprocess the temperature fields \(T_{\text{past}}\) and \(T_{\text{current}}\) as deemed necessary.
- Document and explain the preprocessing steps and assumptions made.
- Dimensionality Reduction using PCA:
- Apply Principal Component Analysis (PCA) to the past temperature fields \(T_{\text{past}}\).
- Retain \(n_{\text{PC}}\) principal components. The choice of \(n_{\text{PC}}\) should be made and defended by the student.
- Find \(K\) Nearest Neighbors:
- Using the Euclidean distance metric, find the \(K\) nearest neighbors to the projected current temperature field from step 2.
- Sample Precipitation Field:
- For each of the \(K\) nearest neighbors, compute the inverse distance as \(w_i = \frac{1}{\text{distance}_i}\) where \(\text{distance}_i\) is the distance to the \(i^{th}\) neighbor.
- This is sometimes called the “kernel” and there are other valid choices (from simple – like distance squared – to more complex)
- Normalize the weights such that they sum to 1: \(w_i = \frac{w_i}{\sum_{j=1}^{K} w_j}\).
- Sample a precipitation field \(P\) from the \(K\) nearest neighbors with probability proportional to the normalized inverse distances \(w_i\).
- For each of the \(K\) nearest neighbors, compute the inverse distance as \(w_i = \frac{1}{\text{distance}_i}\) where \(\text{distance}_i\) is the distance to the \(i^{th}\) neighbor.
- Preprocessing:
1.3 Data
We will use the ERA5 reanalysis dataset, which is a global dataset of atmospheric variables, including temperature and precipitation, for temperature. We use the same precipitation dataset as Lab 05. This data is available to enrolled students on Canvas as precip.nc
and temp.nc
.
1.4 Setup
Remember to git clone
to your machine, then activate
and instantiate
the environment, as we have done in previous labs. Do not use additional packages for this assignment, though you are welcome to look at the documentation or implementation of other packages for nearest-neighbors methods.
2 Data
2.1 Precipitation
We start by loading the precipitation data. Create variables called precip_time
, precip_lon
, precip_lat
, and precip
. precip
should be a 3D array with appropriate units added. Remember to close the file when you have read the data
Review the lab 05 solutions posted on Canvas to make sure you have the right syntax. Additionally, if the latitudes are flipped, you need to reverse them (also in lab 5 solutions!)
= NCDataset("data/precip.nc")
precip_ds # ...
2.2 Temperature
Next we load the temperature data
Create variables called temp_time
, temp_lon
, temp_lat
, and temp
. temp
should have appropriate units.
Make sure that temp_time
and precip_time
are the same! If so, you can create a variable called time
that is equal to temp_time
.
2.3 Split
We will split the data into training and testing sets (we will cover this idea in more detail later). The idea is to use the training set to build the model, and the testing set to evaluate the model.
We will use the last 10 years of data as the testing set, and the rest as the training set. Create variables called precip_train
, precip_test
, temp_train
, and temp_test
, time_train
, and time_test
.
2.3.1 Preprocess
Review the slides on PCA and decide whether you would like to preprocess the temperature data before applying PCA. Explain what you have done and why.
Next, we need to convert the temperature data to a matrix. You can use the reshape
function to do this.1 Call your matrix temp_mat
.
Put all these steps into a function called preprocess
that takes in:
- the raw field of temperature data (indexed lon, lat, time)
- the raw field of temperature data (indexed lon, lat, time) corresponding to the “reference period” over which climatology is returned
- any other inputs you would like (e.g., a vector of times for the reference and target periods)
and returns the matrix of preprocessed temperature data.
function preprocess(temp::Array{T, 3}, temp_reference::Array{T, 3})::AbstractMatrix where T
# ...
end
3 Principal components
We will use the implementation in MultivariateStats.jl
to perform PCA. The key functions to know are:
fit
: Use the syntax likeM = fit(PCA, X_train)
. The model will automatically choose how many prinicpal components to use but you can modify it using theprincipalratio
andmaxoutdim
arguments.- To make plots of variance explained, the
principalvars(M)
,var(M)
, andcumsum
functions may be helpful 😉 - Note: the documentation says: let
(d, n) = size(X)
be respectively the input dimension and the number of observations: this is the opposite of what we have been doing, so you will need to transpose the data.
- To make plots of variance explained, the
predict
: use the syntaxX_pca = predict(M, X)
. This will project the dataX
onto the PCA basis.projection
: returns a matrix where each column is a principal component loading vector and each row is a grid cell. . Use the syntaxX_pca = projection(M)
.
See the documentation here for details.
3.1 Fitting
Following the instructions provided in the documentation, fit the PCA model to the training data. Choose how many dimensions you are using and explain why you chose that number. Create any plots to help inform this choice.
3.2 Visualizing
- Plot the spatial patterns associated with the first two principal components. You’ll need to reshape the data back to its original shape.
- You do not need to add map features. The best package for mapping is GeoMakie, which has a bit of a new plotting syntax and is still a little rough around the edges, though improving rapidly and great in many ways.
- Plot the time series of the first two principal components. (We have the actual times – called
time
– so the \(x\) axis should show the actual time) - Make a scatter plot of the first two principal components, where each day is a day. Color the points by the precipitation value at that time.
- Define “the precipitation value” however you like – it may be the precipitation at a particular point or a spatial mean.
- Use the syntax
scatter(x, y, zcolor=z)
to plot the points, wherex
andy
are the first two principal components andz
is the precipitation value.
4 \(K\)-NN
\(K\) nearest neighbors is a simple prediction algorithm. We will use a resampling-based version of KNN.
We have described the algorithm above; Here’s an example:
- Our dataset is \(X = [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\).
- Our new value is \(X_i = 5.4\).
- The 3 nearest neighbors to \(X_i\) are \([5, 6, 4]\).
- The corresponding distances are \([0.4, 0.6, 1.4]\).
- Thus, we return a sample from \([5, 6, 4]\) with associated probabilities proportional to \([\frac{1}{0.4}, \frac{1}{0.6}, \frac{1}{1.4}]\). This works out to probabilities of approximately \([0.51, 0.34, 0.15]\).
- Instead of returning \(5\), \(6\), or \(4\), it is often advantageous to return the corresponding indices: 7, 8, or 6.
Write a function in Julia to perform generic \(K\) nearest neighbor prediction. As the function signature implies, the first argument should be the training dataa, the second should be the new point to predict, and the third should be the number of neighbors to use. The function should return a tuple containing the indices of the neighbors and the associated probabilities.
function knn(X::AbstractMatrix, X_i::AbstractVector, K::Int)::Tuple{Int,AbstractVector}
# ...
end
In our example, X
would be the principal component loading matrix for the training data, X_i
would be the principal component loading vector for the point we are trying to predict, and K
would be the number of neighbors to use.
You may find the sample
function from StatsBase
to be helpful. You can define weights and use it for weighted sampling:
= Weights(rand(10))
w sample(1:10, w)
5 Combining
If we put these two pieces together, we have a simple yet powerful simulation tool. Now we will implement the full algorithm as a function.
"""
KNN resampling algorithm
temp_train: the training temperature data. This should be the raw field, not the prinipal components. Inside the function, convert to principal components using `n_pca` principal components.
temp_test: the temperature data to predict. This should be the raw field, not the prinipal components. Inside the function, convert to principal components using `n_pca` principal components.
precip_train: the training precipitation data.
"""
function predict_knn(temp_train, temp_test, precip_train; n_pca::Int)
= preprocess(temp_train) # you have defined this function above
temp_train = preprocess(temp_test) # adjust arguments as needed
temp_test
# fit the PCA model to the training data
# project the test data onto the PCA basis
# use the `knn` function for each point in the test data
# return a matrix of predictions
end
Now, you can run this function on several days of data and plot the results. How can you visualize your results?
Footnotes
Access documentation by typing
?reshape
in the REPL.↩︎