Random Forest Models


Lecture

2023-10-18

Reading

I am drawing heavily from Chapter 8 of James et al. (2013). For more, see:

Motivation

Today

  1. Motivation

  2. Decision Trees

  3. Random Forests

  4. Wrapup

Situating ourselves

  • Given: \(\{(X_i, y_i) \mid i = 1, 2, \ldots, n\}\) i.e. paired predictors and targets
    • Supervised learning
  • Goal: approximate a function \(f\)
    • “Regression”
    • Ideally: good predictions on new data

Example datasett

Predict a baseball player’s Salary (thousands of dollars) based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape.

hitters = dataset("ISLR", "Hitters")
first(hitters, 5)
5×20 DataFrame
Row AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
Int32 Int32 Int32 Int32 Int32 Int32 Int32 Int32 Int32 Int32 Int32 Int32 Int32 Cat… Cat… Int32 Int32 Int32 Float64? Cat…
1 293 66 1 30 29 14 1 293 66 1 30 29 14 A E 446 33 20 missing A
2 315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475.0 N
3 479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480.0 A
4 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500.0 N
5 321 87 10 39 42 30 2 396 101 12 48 46 33 N E 805 40 4 91.5 N

Nonlinear Relationships

Code
# Separate the data into two groups: with and without missing Salary values
complete_data = dropmissing(hitters, :Salary)
missing_data = hitters[ismissing.(hitters[:, :Salary]), :]

# Plot the points with valid Salary values, colored based on Salary
p1 = scatter(
    complete_data[:, :Years],
    complete_data[:, :Hits];
    zcolor=log.(complete_data[:, :Salary]),
    xlabel="Years",
    ylabel="Hits",
    label="Log of Salary",
)

# Overlay the points with missing Salary as open circles
scatter!(
    p1,
    missing_data[:, :Years],
    missing_data[:, :Hits];
    markercolor=:white,
    label="Missing Salary",
)

Decision Trees

Today

  1. Motivation

  2. Decision Trees

  3. Random Forests

  4. Wrapup

Partition

One way we can make predictions is to partition the predictor space into \(M\) regions \(R_1, R_2, \ldots, R_M\) and then predict a constant value in each region.

Code
p2 = plot(p1)

# Draw the vertical line for Years < 4.5
plot!(
    p2, [4.5, 4.5], [0, maximum(hitters[:, :Hits])]; line=:dash, color=:black, label=false
)

# Draw the horizontal line for Hits < 117.5 for Years >= 4.5
plot!(
    p2,
    [4.5, maximum(hitters[:, :Years])],
    [117.5, 117.5];
    line=:dash,
    color=:black,
    label=false,
)

# Annotate the regions
annotate!(p2, 2, maximum(hitters[:, :Hits]) - 20, text("R1", 12, :left))
annotate!(p2, 6, 50, text("R2", 12, :left))
annotate!(p2, 6, maximum(hitters[:, :Hits]) - 20, text("R3", 12, :left))

Terminology

  • Decision Node
    • \(\text{Years} < 4.5\)
    • \(\text{Hits} < 117.5\)
    • Hierarchical structure
  • Leaf Node (aka: terminal node, leaf)
    • \(R_1, R_2, R_3\)

Implementation

Code
# Define the Node structure
abstract type AbstractNode end

struct DecisionNode <: AbstractNode
    feature::Symbol
    threshold::Float64
    left::AbstractNode
    right::AbstractNode
end

struct LeafNode <: AbstractNode
    value::Float64
end

# Define the Partition structure
struct Partition
    feature::Symbol
    threshold::Float64
    left::Union{Partition,Nothing}
    right::Union{Partition,Nothing}
end

# Define the DecisionTree structure
struct MyDecisionTree
    root::AbstractNode
end

# Constructor for DecisionTree from DataFrame and partition
function MyDecisionTree(df::DataFrame, partition::Partition, y::Symbol)
    # Recursive function to build the tree
    function build_tree(partition, subset)
        if partition === nothing
            return LeafNode(mean(skipmissing(subset[:, y])))
        end

        left_subset = subset[subset[!, partition.feature] .<= partition.threshold, :]
        right_subset = subset[subset[!, partition.feature] .> partition.threshold, :]

        left = build_tree(partition.left, left_subset)
        right = build_tree(partition.right, right_subset)

        return DecisionNode(partition.feature, partition.threshold, left, right)
    end

    root = build_tree(partition, df)
    return MyDecisionTree(root)
end

function predict(tree::MyDecisionTree, row::DataFrameRow)
    node = tree.root
    while !isa(node, LeafNode)
        if row[node.feature] <= node.threshold
            node = node.left
        else
            node = node.right
        end
    end
    return node.value
end

Our model

partition = Partition(:Years, 4.5, nothing, Partition(:Hits, 117.5, nothing, nothing))
tree = MyDecisionTree(hitters, partition, :Salary)
predictions = [predict(tree, row) for row in eachrow(hitters)]
Code
p3 = scatter(
    hitters[:, :Years],
    hitters[:, :Hits];
    zcolor=log.(predictions),
    xlabel="Years",
    ylabel="Hits",
    label="Predicted",
    title="Partion Model",
)
plot(plot(p1; title="Obs"), p3; layout=(1, 2), size=(1250, 500), link=:both)

More formally

We are making predictions based on stratification of the feature space

  1. Divide the predictor space \(X\) into \(J\) distinct regions \(R_1, R_2, \ldots, R_J\)
  2. For every observation in \(R_j\), make the same prediction
    • \(\hat{y}_j = \frac{1}{N_j} \sum_{i \in R_j} y_i\)

Choosing partitions

How do we choose the regions \(R_1, R_2, \ldots, R_J\)?

  • We could choose anything!
  • High-dimensional “boxes” are simple
  • Find boxes \(R_1, R_2, \ldots, R_J\) that minimize the residual sum of squares (RSS) \[ \sum_{j=1}^J \sum_{i \in R_j} \left(y_i - \hat{y}_i \right)^2 \]

Optimization

Extremely hard problem:

Consider the space of all possible partitions

  • How I choose \(R_1\) will affect the best \(R_63\)

Feasible problem: recursive binary splitting

  1. Select a predictor \(X_j\) and cutpoint \(s\) so that splitting predictor space into \(\{X | X_j < s \}\) and \(\{X | X_j \geq s \}\) minimizes RSS
    • Consider \(J \times N\) possible splits
  2. Repeat, considering a partition on each of the two resulting regions

Top-down, greedy algorithm

Overfitting

  1. A “deep” (many splits) tree will fit our data well
    1. But is likely to overfit
    2. Lower bias, higher variance
    3. We could have \(n\) splits – each observation in its own region!
  2. A “shallow” (few splits) tree will fit our data poorly
    1. But is likely to generalize better
    2. Higher bias, lower variance

Cost complexity penalty

A penalty for the number of splits \(|T|\): \[ \text{Loss} = \sum_{m=1}^{|T|} \sum_{i: X_i \in R_m} \left(y_i - \hat{y}_{R_m} \right)^2 + \alpha |T| \]

Pruning

Empirically, it works well to grow a large tree, then “prune” it back to a smaller tree.

  1. Use recursive binary splitting with MSE loss to grow a large tree
    1. Example stopping rule: all regions have fewer than \(K\) observations
  2. Recursively, find the node with the “weakest link” by removing one split from the tree and seeing which split has the smallest increase in RSS
  3. Choose the tree that minimizes the loss function

Classification trees

We’ve been focusing on regression, but classification is also a common task!

  • Given remote sensing images, classify land uses
  • Given information about a house and flood, predict whether it experienced damage
  • Given some parameters describing population growth rates, climate change, etc, predict whether a community will experience water stress

Same idea but different loss function. For example, cross-entropy loss: \[ D = - \sum_{k=1}^K p_{mk} \log \hat{p}_{mk} \] where \(\hat{p}_{mk}\) is the proportion of observations in region \(m\) that are in class \(k\).

Random Forests

Today

  1. Motivation

  2. Decision Trees

  3. Random Forests

  4. Wrapup

Ensemble methods

  • Combine many “weak” learners into a “strong” learner
  • “Jury”
  • “Wisdom of the crowd”

Key insight

Ensemble methods work better when the weak learners are less correlated

Bagging

Bagging is a general approach for ensemble learning that is especially useful for tree methods.

Problem: decision trees have high variance. If we split our data in half, then fit a decision tree separately to each half, they might be very different.

Concept: averaging a set of observations reduces variance. Recall that given \(n\) IID observations \(Z_i\) with mean \(\bar{Z}\) and variance \(\sigma^2\), the variance of the mean is \(\sigma^2 / n\).

Approach: use a bootstrap to create \(B\) datasets, fit a decision tree to each, and average the predictions. \[ \hat{f}_\text{bag} = \frac{1}{B} \sum_{b=1}^B \hat{f}^{*b}(x) \] where \(\hat{f}^{*b}(x)\) is the prediction of the tree trained on the \(b\)th bootstrap sample, making a prediction for the full dataset.

Random Forests

Problem: the trees in a bagged ensemble are highly correlated. Averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities.

Solution: at each split in the tree, consider only a random subset of the predictors (do not allow the model to split on the rest)

Rationale: suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split, and they will be closely correlated.

Implementation: at each split, randomly select \(m\) predictors out of the \(p\) possible predictors. Typically, we choose \(m \approx \sqrt{p}\). (If \(m=p\) then we are back to regular bagging.)

Boosting

Like bagging, boosting is a general approach that is commonly used in tree methods.

Idea: instead of training each “tree” in the “forest” on a bootstrapped sample of the data, train each tree on a modified version of the data set. Specifically, fit a tree using the current residuals, rather than the outcome, as the response.

Algorithm: 1. Initialize prediction \(\hat{f}_i(x)=0\) and residuals \(r_i = y_i\) for all \(i\) 1. For \(b=1, 2, \ldots, B\): 1. Fit a tree \(\hat{f}^b\) to the training data \((X, r)\) with \(d\) splits 1. Update the prediction: \(\hat{f}(x) = \hat{f}(x) + \lambda \hat{f}^b(x)\) 1. Update residuals 1. Output the boosted model: \(\hat{f}(x) = \sum_{i=1}^B \lambda \hat{f}^b(x)\).

Key parameters: number of trees \(B\), shrinkage rate \(\lambda\), number of splits per tree \(d\)

Julia example

# Drop rows with missing Salary values
hitters_nm = dropmissing(hitters, :Salary)

# Prepare data for training
numerical_cols = [col for col in names(hitters_nm) if eltype(hitters_nm[!, col]) <: Number]
hitters_nm = hitters_nm[:, numerical_cols]
features = Matrix(hitters_nm[:, Not(:Salary)])
labels = vec(hitters_nm[:, :Salary])

# Train a decision tree regressor
model = DecisionTreeRegressor()
fit!(model, features, labels)

# Get predictions
predictions = DecisionTree.predict(model, features)

# Scatter plot of actual vs predicted values
scatter(
    labels,
    predictions;
    xlabel="Actual Salary",
    ylabel="Predicted Salary",
    label="Data Points",
    legend=:topleft,
)

# Plot a diagonal line for perfect predictions
Plots.abline!(1, 0; color=:black, label="Perfect Predictions")

Adjustments

# how many predictors
m = Int(ceil(sqrt(size(features, 2))))

# Train a decision tree regressor
model = RandomForestRegressor(; n_subfeatures=m, n_trees=250)
fit!(model, features, labels)

# Get predictions
predictions = DecisionTree.predict(model, features)

# Scatter plot of actual vs predicted values
scatter(
    labels,
    predictions;
    xlabel="Actual Salary",
    ylabel="Predicted Salary",
    label="Data Points",
    legend=:topleft,
)

# Plot a diagonal line for perfect predictions
Plots.abline!(1, 0; color=:black, label="Perfect Predictions")

Wrapup

Today

  1. Motivation

  2. Decision Trees

  3. Random Forests

  4. Wrapup

Key things to know

  1. Decision trees
    • Why would we fit them?
    • How do they work?
    • Key trade-offs
  2. Tree ensemble methods
    • How do boosting / bagging / RFs work?
    • Be able to outline the algorithm
    • Explain the logic underpinning these methods

References

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning (Vol. 1). Springer series in statistics Springer, Berlin.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (Vol. 103). New York, NY: Springer New York.