Random Forest Models

Lecture

2023-10-18

Reading

I am drawing heavily from Chapter 8 of James et al. (2013). For more, see:

Chapter 15 of Friedman et al. (2001)
Towards Data Science post on Gradient Boosting

Motivation

Today

Motivation
Decision Trees
Random Forests
Wrapup

Situating ourselves

Given: \(\{(X_i, y_i) \mid i = 1, 2, \ldots, n\}\) i.e. paired predictors and targets
- Supervised learning
Goal: approximate a function \(f\)
- “Regression”
- Ideally: good predictions on new data

Example datasett

Predict a baseball player’s Salary (thousands of dollars) based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape.

hitters = dataset("ISLR", "Hitters")
first(hitters, 5)

5×20 DataFrame

Row	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	League	Division	PutOuts	Assists	Errors	Salary	NewLeague
	Int32	Int32	Int32	Int32	Int32	Int32	Int32	Int32	Int32	Int32	Int32	Int32	Int32	Cat…	Cat…	Int32	Int32	Int32	Float64?	Cat…
1	293	66	1	30	29	14	1	293	66	1	30	29	14	A	E	446	33	20	missing	A
2	315	81	7	24	38	39	14	3449	835	69	321	414	375	N	W	632	43	10	475.0	N
3	479	130	18	66	72	76	3	1624	457	63	224	266	263	A	W	880	82	14	480.0	A
4	496	141	20	65	78	37	11	5628	1575	225	828	838	354	N	E	200	11	3	500.0	N
5	321	87	10	39	42	30	2	396	101	12	48	46	33	N	E	805	40	4	91.5	N

Nonlinear Relationships

Code

# Separate the data into two groups: with and without missing Salary values
complete_data = dropmissing(hitters, :Salary)
missing_data = hitters[ismissing.(hitters[:, :Salary]), :]

# Plot the points with valid Salary values, colored based on Salary
p1 = scatter(
    complete_data[:, :Years],
    complete_data[:, :Hits];
    zcolor=log.(complete_data[:, :Salary]),
    xlabel="Years",
    ylabel="Hits",
    label="Log of Salary",
)

# Overlay the points with missing Salary as open circles
scatter!(
    p1,
    missing_data[:, :Years],
    missing_data[:, :Hits];
    markercolor=:white,
    label="Missing Salary",
)

Decision Trees

Today

Motivation
Decision Trees
Random Forests
Wrapup

Partition

One way we can make predictions is to partition the predictor space into \(M\) regions \(R_1, R_2, \ldots, R_M\) and then predict a constant value in each region.

Code

p2 = plot(p1)

# Draw the vertical line for Years < 4.5
plot!(
    p2, [4.5, 4.5], [0, maximum(hitters[:, :Hits])]; line=:dash, color=:black, label=false
)

# Draw the horizontal line for Hits < 117.5 for Years >= 4.5
plot!(
    p2,
    [4.5, maximum(hitters[:, :Years])],
    [117.5, 117.5];
    line=:dash,
    color=:black,
    label=false,
)

# Annotate the regions
annotate!(p2, 2, maximum(hitters[:, :Hits]) - 20, text("R1", 12, :left))
annotate!(p2, 6, 50, text("R2", 12, :left))
annotate!(p2, 6, maximum(hitters[:, :Hits]) - 20, text("R3", 12, :left))

Terminology

Decision Node
- \(\text{Years} < 4.5\)
- \(\text{Hits} < 117.5\)
- Hierarchical structure
Leaf Node (aka: terminal node, leaf)
- \(R_1, R_2, R_3\)

Implementation

Code

# Define the Node structure
abstract type AbstractNode end

struct DecisionNode <: AbstractNode
    feature::Symbol
    threshold::Float64
    left::AbstractNode
    right::AbstractNode
end

struct LeafNode <: AbstractNode
    value::Float64
end

# Define the Partition structure
struct Partition
    feature::Symbol
    threshold::Float64
    left::Union{Partition,Nothing}
    right::Union{Partition,Nothing}
end

# Define the DecisionTree structure
struct MyDecisionTree
    root::AbstractNode
end

# Constructor for DecisionTree from DataFrame and partition
function MyDecisionTree(df::DataFrame, partition::Partition, y::Symbol)
    # Recursive function to build the tree
    function build_tree(partition, subset)
        if partition === nothing
            return LeafNode(mean(skipmissing(subset[:, y])))
        end

        left_subset = subset[subset[!, partition.feature] .<= partition.threshold, :]
        right_subset = subset[subset[!, partition.feature] .> partition.threshold, :]

        left = build_tree(partition.left, left_subset)
        right = build_tree(partition.right, right_subset)

        return DecisionNode(partition.feature, partition.threshold, left, right)
    end

    root = build_tree(partition, df)
    return MyDecisionTree(root)
end

function predict(tree::MyDecisionTree, row::DataFrameRow)
    node = tree.root
    while !isa(node, LeafNode)
        if row[node.feature] <= node.threshold
            node = node.left
        else
            node = node.right
        end
    end
    return node.value
end

Our model

partition = Partition(:Years, 4.5, nothing, Partition(:Hits, 117.5, nothing, nothing))
tree = MyDecisionTree(hitters, partition, :Salary)
predictions = [predict(tree, row) for row in eachrow(hitters)]

Code

p3 = scatter(
    hitters[:, :Years],
    hitters[:, :Hits];
    zcolor=log.(predictions),
    xlabel="Years",
    ylabel="Hits",
    label="Predicted",
    title="Partion Model",
)
plot(plot(p1; title="Obs"), p3; layout=(1, 2), size=(1250, 500), link=:both)

More formally

We are making predictions based on stratification of the feature space

Divide the predictor space \(X\) into \(J\) distinct regions \(R_1, R_2, \ldots, R_J\)
For every observation in \(R_j\), make the same prediction
- \(\hat{y}_j = \frac{1}{N_j} \sum_{i \in R_j} y_i\)

Choosing partitions

How do we choose the regions \(R_1, R_2, \ldots, R_J\)?

We could choose anything!
High-dimensional “boxes” are simple
Find boxes \(R_1, R_2, \ldots, R_J\) that minimize the residual sum of squares (RSS) \[ \sum_{j=1}^J \sum_{i \in R_j} \left(y_i - \hat{y}_i \right)^2 \]

Optimization

Extremely hard problem:

Consider the space of all possible partitions

How I choose \(R_1\) will affect the best \(R_63\)

Feasible problem: recursive binary splitting

Select a predictor \(X_j\) and cutpoint \(s\) so that splitting predictor space into \(\{X | X_j < s \}\) and \(\{X | X_j \geq s \}\) minimizes RSS
- Consider \(J \times N\) possible splits
Repeat, considering a partition on each of the two resulting regions
…

Top-down, greedy algorithm

Overfitting

A “deep” (many splits) tree will fit our data well
1. But is likely to overfit
2. Lower bias, higher variance
3. We could have \(n\) splits – each observation in its own region!
A “shallow” (few splits) tree will fit our data poorly
1. But is likely to generalize better
2. Higher bias, lower variance

Cost complexity penalty

A penalty for the number of splits \(|T|\): \[ \text{Loss} = \sum_{m=1}^{|T|} \sum_{i: X_i \in R_m} \left(y_i - \hat{y}_{R_m} \right)^2 + \alpha |T| \]

Pruning

Empirically, it works well to grow a large tree, then “prune” it back to a smaller tree.

Use recursive binary splitting with MSE loss to grow a large tree
1. Example stopping rule: all regions have fewer than \(K\) observations
Recursively, find the node with the “weakest link” by removing one split from the tree and seeing which split has the smallest increase in RSS
Choose the tree that minimizes the loss function

Classification trees

We’ve been focusing on regression, but classification is also a common task!

Given remote sensing images, classify land uses
Given information about a house and flood, predict whether it experienced damage
Given some parameters describing population growth rates, climate change, etc, predict whether a community will experience water stress

Same idea but different loss function. For example, cross-entropy loss: \[ D = - \sum_{k=1}^K p_{mk} \log \hat{p}_{mk} \] where \(\hat{p}_{mk}\) is the proportion of observations in region \(m\) that are in class \(k\).

Random Forests

Today

Motivation
Decision Trees
Random Forests
Wrapup

Ensemble methods

Combine many “weak” learners into a “strong” learner
“Jury”
“Wisdom of the crowd”

Key insight

Ensemble methods work better when the weak learners are less correlated

Bagging

Bagging is a general approach for ensemble learning that is especially useful for tree methods.

Problem: decision trees have high variance. If we split our data in half, then fit a decision tree separately to each half, they might be very different.

Concept: averaging a set of observations reduces variance. Recall that given \(n\) IID observations \(Z_i\) with mean \(\bar{Z}\) and variance \(\sigma^2\), the variance of the mean is \(\sigma^2 / n\).

Approach: use a bootstrap to create \(B\) datasets, fit a decision tree to each, and average the predictions. \[ \hat{f}_\text{bag} = \frac{1}{B} \sum_{b=1}^B \hat{f}^{*b}(x) \] where \(\hat{f}^{*b}(x)\) is the prediction of the tree trained on the \(b\)th bootstrap sample, making a prediction for the full dataset.

Random Forests

Problem: the trees in a bagged ensemble are highly correlated. Averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities.

Solution: at each split in the tree, consider only a random subset of the predictors (do not allow the model to split on the rest)

Rationale: suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split, and they will be closely correlated.

Implementation: at each split, randomly select \(m\) predictors out of the \(p\) possible predictors. Typically, we choose \(m \approx \sqrt{p}\). (If \(m=p\) then we are back to regular bagging.)

Boosting

Like bagging, boosting is a general approach that is commonly used in tree methods.

Idea: instead of training each “tree” in the “forest” on a bootstrapped sample of the data, train each tree on a modified version of the data set. Specifically, fit a tree using the current residuals, rather than the outcome, as the response.

Algorithm: 1. Initialize prediction \(\hat{f}_i(x)=0\) and residuals \(r_i = y_i\) for all \(i\) 1. For \(b=1, 2, \ldots, B\): 1. Fit a tree \(\hat{f}^b\) to the training data \((X, r)\) with \(d\) splits 1. Update the prediction: \(\hat{f}(x) = \hat{f}(x) + \lambda \hat{f}^b(x)\) 1. Update residuals 1. Output the boosted model: \(\hat{f}(x) = \sum_{i=1}^B \lambda \hat{f}^b(x)\).

Key parameters: number of trees \(B\), shrinkage rate \(\lambda\), number of splits per tree \(d\)

Julia example

# Drop rows with missing Salary values
hitters_nm = dropmissing(hitters, :Salary)

# Prepare data for training
numerical_cols = [col for col in names(hitters_nm) if eltype(hitters_nm[!, col]) <: Number]
hitters_nm = hitters_nm[:, numerical_cols]
features = Matrix(hitters_nm[:, Not(:Salary)])
labels = vec(hitters_nm[:, :Salary])

# Train a decision tree regressor
model = DecisionTreeRegressor()
fit!(model, features, labels)

# Get predictions
predictions = DecisionTree.predict(model, features)

# Scatter plot of actual vs predicted values
scatter(
    labels,
    predictions;
    xlabel="Actual Salary",
    ylabel="Predicted Salary",
    label="Data Points",
    legend=:topleft,
)

# Plot a diagonal line for perfect predictions
Plots.abline!(1, 0; color=:black, label="Perfect Predictions")

Adjustments

# how many predictors
m = Int(ceil(sqrt(size(features, 2))))

# Train a decision tree regressor
model = RandomForestRegressor(; n_subfeatures=m, n_trees=250)
fit!(model, features, labels)

# Get predictions
predictions = DecisionTree.predict(model, features)

# Scatter plot of actual vs predicted values
scatter(
    labels,
    predictions;
    xlabel="Actual Salary",
    ylabel="Predicted Salary",
    label="Data Points",
    legend=:topleft,
)

# Plot a diagonal line for perfect predictions
Plots.abline!(1, 0; color=:black, label="Perfect Predictions")

Wrapup

Today

Motivation
Decision Trees
Random Forests
Wrapup

Key things to know

Decision trees
- Why would we fit them?
- How do they work?
- Key trade-offs
Tree ensemble methods
- How do boosting / bagging / RFs work?
- Be able to outline the algorithm
- Explain the logic underpinning these methods

References

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning (Vol. 1). Springer series in statistics Springer, Berlin.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (Vol. 103). New York, NY: Springer New York.