Hyperparameter tuning

Lecture

2023-10-20

Theory

Today

Theory
Practice
Wrapup

Parametric function approximation

Today, we continue our “regression” example which is a supervised learning problem.
Recall: we model an unknown function \(f\) using some parameters \(\theta\). Thus, finding \(\hat{f}\) is equivalent to choosing appropriate \(\theta\).
Ideally, we want to maximize performance on new data.

Overfitting

MATLAB/Simulink

Hyperparameters

Many ML models (like a Random Forest) have nested parameters, sometimes called “hyperparameters”

When we “fit” a model, we are finding the best parameters
- where to partition the region
- i.e., the best tree
But, we also have to choose the number of trees
- this is a hyperparameter
Hyperparameters are not optimized during model training

Grid search

Very simple idea

Predefine a set of \(S\) hyperparameter sets
For \(s=1, \ldots, S\) fit the model with the \(k\)th hyperparameter set
Choose the best model

Cross-validation

Key idea: we want to evaluate the model on data that was not used to fit the model (out of sample)

Split the data into \(K\) folds
For \(k=1, \ldots, K\)
1. Fit the model on all folds except the \(k\)th
2. Evaluate the model on the \(k\)th fold
Average the performance across all folds

Cross-validation helps to reduce the variance of estimated model performance. However, cross-validated estimates of model performance are still biased – essentially, you can overfit hyperparameters

Train-test split

We want to know how well the model performs on new data!

Split the data into a training set and a test set
1. Often 80-20 or 70-30
2. For spatially or temporally structured data, structured splits essential
Fit the model on the training set
1. Including any cross-validation
Evaluate the model on the test set as a final step

Practice

Today

Theory
Practice
Wrapup

Data

data = dataset("ISLR", "Hitters")
dropmissing!(data, :Salary)
numerical_cols = [col for col in names(data) if eltype(data[!, col]) <: Number]
data = data[:, numerical_cols]
describe(data)

17×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Float64	Real	Float64	Real	Int64	DataType
1	AtBat	403.643	19	413.0	687	0	Int32
2	Hits	107.829	1	103.0	238	0	Int32
3	HmRun	11.6198	0	9.0	40	0	Int32
4	Runs	54.7452	0	52.0	130	0	Int32
5	RBI	51.4867	0	47.0	121	0	Int32
6	Walks	41.1141	0	37.0	105	0	Int32
7	Years	7.31179	1	6.0	24	0	Int32
8	CAtBat	2657.54	19	1931.0	14053	0	Int32
9	CHits	722.186	4	516.0	4256	0	Int32
10	CHmRun	69.2395	0	40.0	548	0	Int32
11	CRuns	361.221	2	250.0	2165	0	Int32
12	CRBI	330.418	3	230.0	1659	0	Int32
13	CWalks	260.266	1	174.0	1566	0	Int32
14	PutOuts	290.711	0	224.0	1377	0	Int32
15	Assists	118.76	0	45.0	492	0	Int32
16	Errors	8.59316	0	7.0	32	0	Int32
17	Salary	535.926	67.5	425.0	2460.0	0	Float64

Partition data

R is the target, everything else is a feature

y, X = unpack(data, ==(:Salary); rng=123)

70% training, 30% testing

Load the model

RandomForestRegressor = @load RandomForestRegressor pkg = DecisionTree
model = RandomForestRegressor()

import MLJDecisionTreeInterface ✔

RandomForestRegressor(
  max_depth = -1, 
  min_samples_leaf = 1, 
  min_samples_split = 2, 
  min_purity_increase = 0.0, 
  n_subfeatures = -1, 
  n_trees = 100, 
  sampling_fraction = 0.7, 
  feature_importance = :impurity, 
  rng = Random._GLOBAL_RNG())

Train the model

mach = machine(model, X, y)
train, test = partition(eachindex(y), 0.7; shuffle=true, rng=123)
fit!(mach; rows=train)

trained Machine; caches model-specific representations of data
  model: RandomForestRegressor(max_depth = -1, …)
  args: 
    1:  Source @204 ⏎ Table{AbstractVector{Count}}
    2:  Source @693 ⏎ AbstractVector{Continuous}

Predict on the test set

y_pred_test = predict(mach, X[test, :])
rms_test = root_mean_squared_error(y_pred_test, y[test])

y_pred_train = predict(mach, X[train, :])
rms_train = root_mean_squared_error(y_pred_train, y[train])

rms_train, rms_test

(160.31411979262353, 188.18990826667437)

Visualize

Code

ps = map(zip([train, test], ["Train", "Test"])) do (idx, name)
    scatter(
        y[idx],
        predict(mach, X[idx, :]);
        xlabel="Actual",
        ylabel="Predicted",
        label="Model",
        title=name,
        legend=:bottomright
    )
    Plots.abline!(1, 0; label="1:1 line")
end
plot(ps...; link=:both, size=(1000, 500))

Tuning

n_trees_range = range(model, :n_trees; lower=10, upper=150, scale=:log10)
n_subfeatures_range = range(model, :n_subfeatures; lower=1, upper=size(X, 2))

tuning = TunedModel(;
    model=model,
    tuning=Grid(; goal=25),  # Using a grid search with 25 points in total
    resampling=CV(; nfolds=5, rng=123),  # 5-fold cross-validation
    measure=root_mean_squared_error,  # Evaluation metric
    ranges=[n_trees_range, n_subfeatures_range]
)

Fit and ID best model

tuned_mach = machine(tuning, X, y)
fit!(tuned_mach; rows=train)
best_model = fitted_params(tuned_mach).best_model

println("Best n_trees: ", best_model.n_trees)
println("Best n_subfeatures: ", best_model.n_subfeatures)

Best n_trees: 10
Best n_subfeatures: 12

Fit the best model and make predictions

best_mach = machine(best_model, X, y)
fit!(best_mach; rows=train)

y_pred_test2 = predict(best_mach, X[test, :])
rms_test2 = root_mean_squared_error(y_pred_test2, y[test])

y_pred_train2 = predict(best_mach, X[train, :])
rms_train2 = root_mean_squared_error(y_pred_train2, y[train])

rms_train2, rms_test2

(186.9403933790913, 193.79639432645558)

Important

We have achieved better performance on the training set, but worse performance on the test set!

More resources

MLJ Docs – maintained by Julia AI organization / Turing Institute
ScikitLearn Docs – uses Scikit-Learn (Python package) interface and wraps models

Wrapup

Today

Theory
Practice
Wrapup

Project

Posted on Canvas.

If you’re not sure what problem to tackle, some ideas are:

“Downscaling”: Map coarse resolution data to fine resolution
“Forecasting”: Predict future values of a variable

Questions?

Exams

Exam 1: I’m working through your corrections

Exam 2: next Friday!

Means with missing values

Let’s say you have an array with some missing values, and you want to take the mean across a particular dimension. How can you do this?

X = convert(Array{Union{Float64,Missing},3}, rand(11, 10, 9))

K = 10
X[shuffle(1:length(X))[1:K]] .= missing

10-element view(reshape(::Array{Union{Missing, Float64}, 3}, 990), [12, 668, 692, 312, 248, 801, 427, 782, 247, 346]) with eltype Union{Missing, Float64}:
 missing
 missing
 missing
 missing
 missing
 missing
 missing
 missing
 missing
 missing

We’d like to do this, but we’ll get msising values

X_mean = mean(X; dims=3)
sum(ismissing.(X_mean))

Instead, we can do:

X_mean = [mean(skipmissing(X[i, j, :])) for i in axes(X, 1), j in axes(X, 2)]
@assert size(X_mean) == (11, 10)