Hyperparameter tuning


Lecture

2023-10-20

Theory

Today

  1. Theory

  2. Practice

  3. Wrapup

Parametric function approximation

  1. Today, we continue our “regression” example which is a supervised learning problem.
  2. Recall: we model an unknown function \(f\) using some parameters \(\theta\). Thus, finding \(\hat{f}\) is equivalent to choosing appropriate \(\theta\).
  3. Ideally, we want to maximize performance on new data.

Overfitting

MATLAB/Simulink

Hyperparameters

Many ML models (like a Random Forest) have nested parameters, sometimes called “hyperparameters”

  • When we “fit” a model, we are finding the best parameters
    • where to partition the region
    • i.e., the best tree
  • But, we also have to choose the number of trees
    • this is a hyperparameter
  • Hyperparameters are not optimized during model training

Cross-validation

Key idea: we want to evaluate the model on data that was not used to fit the model (out of sample)

  1. Split the data into \(K\) folds
  2. For \(k=1, \ldots, K\)
    1. Fit the model on all folds except the \(k\)th
    2. Evaluate the model on the \(k\)th fold
  3. Average the performance across all folds

Cross-validation helps to reduce the variance of estimated model performance. However, cross-validated estimates of model performance are still biased – essentially, you can overfit hyperparameters

Train-test split

We want to know how well the model performs on new data!

  1. Split the data into a training set and a test set
    1. Often 80-20 or 70-30
    2. For spatially or temporally structured data, structured splits essential
  2. Fit the model on the training set
    1. Including any cross-validation
  3. Evaluate the model on the test set as a final step

Practice

Today

  1. Theory

  2. Practice

  3. Wrapup

Data

data = dataset("ISLR", "Hitters")
dropmissing!(data, :Salary)
numerical_cols = [col for col in names(data) if eltype(data[!, col]) <: Number]
data = data[:, numerical_cols]
describe(data)
17×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Float64 Real Float64 Real Int64 DataType
1 AtBat 403.643 19 413.0 687 0 Int32
2 Hits 107.829 1 103.0 238 0 Int32
3 HmRun 11.6198 0 9.0 40 0 Int32
4 Runs 54.7452 0 52.0 130 0 Int32
5 RBI 51.4867 0 47.0 121 0 Int32
6 Walks 41.1141 0 37.0 105 0 Int32
7 Years 7.31179 1 6.0 24 0 Int32
8 CAtBat 2657.54 19 1931.0 14053 0 Int32
9 CHits 722.186 4 516.0 4256 0 Int32
10 CHmRun 69.2395 0 40.0 548 0 Int32
11 CRuns 361.221 2 250.0 2165 0 Int32
12 CRBI 330.418 3 230.0 1659 0 Int32
13 CWalks 260.266 1 174.0 1566 0 Int32
14 PutOuts 290.711 0 224.0 1377 0 Int32
15 Assists 118.76 0 45.0 492 0 Int32
16 Errors 8.59316 0 7.0 32 0 Int32
17 Salary 535.926 67.5 425.0 2460.0 0 Float64

Partition data

R is the target, everything else is a feature

y, X = unpack(data, ==(:Salary); rng=123)
  1. 70% training, 30% testing

Load the model

RandomForestRegressor = @load RandomForestRegressor pkg = DecisionTree
model = RandomForestRegressor()
import MLJDecisionTreeInterface ✔
RandomForestRegressor(
  max_depth = -1, 
  min_samples_leaf = 1, 
  min_samples_split = 2, 
  min_purity_increase = 0.0, 
  n_subfeatures = -1, 
  n_trees = 100, 
  sampling_fraction = 0.7, 
  feature_importance = :impurity, 
  rng = Random._GLOBAL_RNG())

Train the model

mach = machine(model, X, y)
train, test = partition(eachindex(y), 0.7; shuffle=true, rng=123)
fit!(mach; rows=train)
trained Machine; caches model-specific representations of data
  model: RandomForestRegressor(max_depth = -1, …)
  args: 
    1:  Source @204 ⏎ Table{AbstractVector{Count}}
    2:  Source @693 ⏎ AbstractVector{Continuous}

Predict on the test set

y_pred_test = predict(mach, X[test, :])
rms_test = root_mean_squared_error(y_pred_test, y[test])

y_pred_train = predict(mach, X[train, :])
rms_train = root_mean_squared_error(y_pred_train, y[train])

rms_train, rms_test
(160.31411979262353, 188.18990826667437)

Visualize

Code
ps = map(zip([train, test], ["Train", "Test"])) do (idx, name)
    scatter(
        y[idx],
        predict(mach, X[idx, :]);
        xlabel="Actual",
        ylabel="Predicted",
        label="Model",
        title=name,
        legend=:bottomright
    )
    Plots.abline!(1, 0; label="1:1 line")
end
plot(ps...; link=:both, size=(1000, 500))

Tuning

n_trees_range = range(model, :n_trees; lower=10, upper=150, scale=:log10)
n_subfeatures_range = range(model, :n_subfeatures; lower=1, upper=size(X, 2))

tuning = TunedModel(;
    model=model,
    tuning=Grid(; goal=25),  # Using a grid search with 25 points in total
    resampling=CV(; nfolds=5, rng=123),  # 5-fold cross-validation
    measure=root_mean_squared_error,  # Evaluation metric
    ranges=[n_trees_range, n_subfeatures_range]
)

Fit and ID best model

tuned_mach = machine(tuning, X, y)
fit!(tuned_mach; rows=train)
best_model = fitted_params(tuned_mach).best_model

println("Best n_trees: ", best_model.n_trees)
println("Best n_subfeatures: ", best_model.n_subfeatures)
Best n_trees: 10
Best n_subfeatures: 12

Fit the best model and make predictions

best_mach = machine(best_model, X, y)
fit!(best_mach; rows=train)

y_pred_test2 = predict(best_mach, X[test, :])
rms_test2 = root_mean_squared_error(y_pred_test2, y[test])

y_pred_train2 = predict(best_mach, X[train, :])
rms_train2 = root_mean_squared_error(y_pred_train2, y[train])

rms_train2, rms_test2
(186.9403933790913, 193.79639432645558)

Important

We have achieved better performance on the training set, but worse performance on the test set!

More resources

  • MLJ Docs – maintained by Julia AI organization / Turing Institute
  • ScikitLearn Docs – uses Scikit-Learn (Python package) interface and wraps models

Wrapup

Today

  1. Theory

  2. Practice

  3. Wrapup

Project

Posted on Canvas.

If you’re not sure what problem to tackle, some ideas are:

  • “Downscaling”: Map coarse resolution data to fine resolution
  • “Forecasting”: Predict future values of a variable

Questions?

Exams

Exam 1: I’m working through your corrections

Exam 2: next Friday!

Means with missing values

Let’s say you have an array with some missing values, and you want to take the mean across a particular dimension. How can you do this?

X = convert(Array{Union{Float64,Missing},3}, rand(11, 10, 9))

K = 10
X[shuffle(1:length(X))[1:K]] .= missing
10-element view(reshape(::Array{Union{Missing, Float64}, 3}, 990), [12, 668, 692, 312, 248, 801, 427, 782, 247, 346]) with eltype Union{Missing, Float64}:
 missing
 missing
 missing
 missing
 missing
 missing
 missing
 missing
 missing
 missing

We’d like to do this, but we’ll get msising values

X_mean = mean(X; dims=3)
sum(ismissing.(X_mean))
9

Instead, we can do:

X_mean = [mean(skipmissing(X[i, j, :])) for i in axes(X, 1), j in axes(X, 2)]
@assert size(X_mean) == (11, 10)