Intro to Machine Learning


Lecture

2023-10-06

Practice Problem

We recorded the number of accidents and average vehicle speed in a city over several months:

Month Average Speed (km/h) Number of Accidents
Jan 45 10
Feb 50 8
Mar 55 12
Apr 60 15
  1. Identify a suitable distribution for modeling the number of accidents.
  2. Write down the likelihood for the chosen distribution.
  3. Suggest a link function to relate the average speed to the number of accidents.
  4. Briefly explain why the chosen link function is appropriate.

Logistics

Motivation

Today

  1. Motivation

  2. Supervised learning

  3. Loss Functions

  4. Bias-Variance Trade-Off

  5. Wrapup

Example data

We want to make predictions about the value of \(y\) given some \(x\).

The true function will be \[ f(x) = 2x + x \sin(2 \pi x) \] but let’s assume we don’t know it.

Viz

Code
f(x) = 2x + x * sin(2pi * x)
N = 100
Random.seed!(1017)
x = rand(Uniform(0, 2), N)
X = hcat(ones(N), x)
y = f.(x) .+ rand(Normal(0, 1), N)
p_base = scatter(x, y; label="Obs", xlabel=L"$x$", ylabel=L"$y$")
p1 = plot(p_base)
plot!(p1, f; label=L"$f(x)$")

Linear regression

\[ y | X \sim \mathcal{N}(X^T \beta, \sigma^2 I) \]

Notation

We will frequently use this linear algebra notation, which is equivalent to writing \[ y_i | X_i \sim \mathcal{N} \left(\sum_{j=1}^J X_{ij} \beta_j, \sigma^2 \right). \]

Bayesian inference

We’ve seen this before:

@model function lin_reg(X, y)
    N, P = size(X)
    β ~ MvNormal(zeros(P), 10 * ones(P)) # prior: βₚ ~ N(0, 10)
    σ ~ InverseGamma(1, 5)
    return y ~ MvNormal(X * β, σ)
end
lin_reg (generic function with 2 methods)
Code
chn_lin_reg = let
    model = lin_reg(X, y)
    sampler = externalsampler(DynamicHMC.NUTS())
    n_per_chain = 5000
    nchains = 4
    sample(model, sampler, MCMCThreads(), n_per_chain, nchains; drop_warmup=true)
end
Chains MCMC chain (5000×4×4 Array{Float64, 3}):
Iterations        = 1:1:5000
Number of chains  = 4
Samples per chain = 5000
Wall duration     = 7.19 seconds
Compute duration  = 5.81 seconds
parameters        = β[1], β[2], σ
internals         = lp
Summary Statistics
  parameters      mean       std      mcse     ess_bulk     ess_tail      rhat     Symbol   Float64   Float64   Float64      Float64      Float64   Float64 ⋯
        β[1]    0.0015    0.2740    0.0033    6826.1387    8822.3325    1.0005 ⋯
        β[2]    1.7479    0.2267    0.0027    6885.6942    8616.7500    1.0005 ⋯
           σ    1.3674    0.0994    0.0010   10957.0709   11254.4262    1.0005 ⋯
                                                                1 column omitted
Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 
        β[1]   -0.5366   -0.1865    0.0046    0.1854    0.5389
        β[2]    1.3019    1.5954    1.7459    1.9002    2.1950
           σ    1.1908    1.2977    1.3601    1.4306    1.5790

Point estimate

We can calcualte a point estimate instead of sampling from the posterior. This might make sense if:

  1. Just need a plausible value of the parameters
  2. Don’t need to carefully quantify parametric uncertainty
  3. (Caveat: for linear regression, analytic approximations of posterior are available)

Computation

We consider two point estimates:

  • Maximum likelihood estimate (MLE): \(\arg \max_\theta p(y | \theta)\)
  • Maximum a posteriori estimate (MAP): \(\arg \max_\theta p(\theta | y)\)
Code
reg_model = lin_reg(X, y)
1θ̂_map = optimize(reg_model, MAP()).values.array
θ̂_mle = optimize(reg_model, MLE()).values.array

f̂_mle(x) = θ̂_mle[1] + θ̂_mle[2] * x
f̂_map(x) = θ̂_map[1] + θ̂_map[2] * x

p2 = plot(p_base)
x_plot = range(minimum(x), maximum(x); length=250)
plot!(p2, x_plot, f̂_mle.(x_plot); label="MLE", linewidth=3)
plot!(p2, x_plot, f̂_map.(x_plot); label="MAP", alpha=0.5, linewidth=2)
1
Turing provides special methods for optimize so we can plug in our model (with data) and MAP() or MLE() and it will figure out the loss function and upper/lower bounds for us.

Critiquing the fit

One way we can tell that our fit is terrible is by plotting the residuals. We have assumed that the residuals are IID. However, we can see that the residuals are correlated with our predictor!

ϵ_regression = y .- f̂_mle.(x)
p3 = scatter(x, ϵ_regression; label="Residuals", xlabel=L"$x$", ylabel=L"$\epsilon$")

Expanding regression

Without changing our data generating process, we can expand our regression model to include more features. For example: \[ \mu_i = \beta_0 + \sum_{p=1}^P \beta_p x_{p}^ p \] but we still have \[ y_i | X_i \sim \mathcal{N}(\mu_i, \sigma^2) \]

Code
function log_lik_poly(β, σ, x, y)
    μ = [sum([βi * xi^(i - 1) for (i, βi) in enumerate(β)]) for xi in x]
    return sum(logpdf.(Normal.(μ, σ), y))
end
order = 7
loss_poly(θ) = -log_lik_poly(θ[1:(order+1)], θ[order+2], x, y)

lower = -Inf * ones(order + 2)
lower[end] = 0
upper = Inf * ones(order + 2)
guess = 0.5 * ones(order + 2)
res = optimize(loss_poly, lower, upper, guess)
θ̂_poly = res.minimizer
f̂_poly(x) = θ̂_poly[1] + sum(θ̂_poly[1+k] * x^k for k in 1:order)

p4 = plot(p_base)
plot!(p4, x_plot, f.(x_plot); label="True Function")
plot!(p4, x_plot, f̂_poly.(x_plot); label="Polynomial Fit K=$order")

What are some possible problems with this approach?

Supervised learning

Today

  1. Motivation

  2. Supervised learning

  3. Loss Functions

  4. Bias-Variance Trade-Off

  5. Wrapup

Overview

We are interested generally in estimating some function \(f\) that maps inputs \(X\) to outputs \(y\). \[ y = f(X) + \epsilon \]

Nonparametric methods

\(K\) nearest neighbors (KNN): find the \(K\) training examples that are closest to a given input and returns the average output.

Note

Nonparametric does not mean that there are no parameters. For example, \(K\) is a parameter!

Code
# data structure to make nearest neighbor search fast -- don't worry about this
kdtree = KDTree(transpose(x))

function knn_predict(xi, k)
    # find the k nearest neighbors
    idxs, _ = knn(kdtree, [xi], k, true)
    # return the average of the y values
    return mean(y[idxs])
end

ŷ_3 = knn_predict.(x_plot, 3)
ŷ_10 = knn_predict.(x_plot, 10)

p5 = plot(p_base)
plot!(p5, x_plot, ŷ_3; label="K=3")
plot!(p5, x_plot, ŷ_10; label="K=10")

What are some problems with this \(K\)NN model?

Parametric methods

Parametric methods model the function \(f\) using some parameters \(\theta\). Thus, finding \(\hat{f}\) is equivalent to choosing appropriate \(\theta\).

The linear regression example we’ve been working with is a parametric function approximation.

Loss Functions

Today

  1. Motivation

  2. Supervised learning

  3. Loss Functions

  4. Bias-Variance Trade-Off

  5. Wrapup

Loss functions

We need to define what we mean by a “best” approximation.

  • What? Measures difference between predicted and actual values.
  • Why? Guide optimization towards best model parameters.
  • Types: Vary by algorithm and task (e.g., regression vs. classification).
  • Impact: Choice can significantly affect model performance.
  • Key: Align loss function with task objectives for best results.

Common Loss Functions

  1. MSE (Mean Squared Error): \(L(y, \hat{y}) = (y - \hat{y})^2\). Emphasizes larger errors but is sensitive to outliers.
  2. MAE (Mean Absolute Error): \(L(y, \hat{y}) = |y - \hat{y}|\). Less sensitive to outliers and non-differentiable at zero.
  3. Huber Loss: \[L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{for } |y - \hat{y}| \leq \delta \\ \delta \( |y - \hat{y}| - \frac{1}{2}\delta^2 \) & \text{otherwise} \end{cases}\] Combines MSE and MAE. Requires threshold parameter \(\delta\).
  4. Quantile Loss: \(L_\tau(y, \hat{y}) = \tau(y - \hat{y})\) if \((y - \hat{y}) > 0\) else \((\tau - 1)(y - \hat{y})\). Tailored to specific quantiles (\(\tau\)). Useful for asymmetric errors.

Visualization

Code
# Sample residuals
residuals = -2:0.01:2

# Loss functions
mse_loss = residuals .^ 2
mae_loss = abs.(residuals)
huber_loss = [abs(r) <= 1 ? 0.5 * r^2 : abs(r) - 0.5 for r in residuals]
quantile_tau = 0.5  # example quantile value
quantile_loss = [r > 0 ? quantile_tau * r : (1 - quantile_tau) * r for r in residuals]

# Plot
plot(residuals, mse_loss; label="MSE", lw=2)
plot!(residuals, mae_loss; label="MAE", lw=2)
plot!(residuals, huber_loss; label="Huber", lw=2)
plot!(residuals, quantile_loss; label="Quantile (τ=0.5)", lw=2)
xlabel!(L"Residual ($y - \hat{y}$)")
ylabel!("Loss")
title!("Loss Functions Visualization")

Bias-Variance Trade-Off

Today

  1. Motivation

  2. Supervised learning

  3. Loss Functions

  4. Bias-Variance Trade-Off

  5. Wrapup

Motivation

  1. Model Performance: Why does our model make errors? Can we reduce them?
  2. Overfitting vs Underfitting: How do we balance fitting our data well and ensuring our model generalizes to new data?
  3. Model Complexity: As we add more features or increase model complexity, how does it impact our model’s errors?
  4. Optimal Model: How do we find the sweet spot where our model has the best predictive performance?
  5. Interpretability: Understanding bias and variance can help in making informed decisions about model selection and complexity.

Key Insight: Every model error can be decomposed into bias, variance, and irreducible error. Balancing bias and variance is crucial for creating models that perform well on both training and unseen data.

Bias and variance

High bias, low variance

High bias, high variance

Low bias, low variance

Low bias, high variance

Figure 1: Bias-variance illustration (Wikipedia)

Mean Squared Error (MSE)

The expected prediction error for any machine learning algorithm can be broken down as:

\[ \text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} \]

  1. Bias: How much on average are the predicted values different from the actual values? Represents errors due to overly simplistic assumptions in the learning algorithm. \[ \text{Bias}(\hat{f}(x)) = E[\hat{f}(x) - f(x)] \]

  2. Variance: How much does the prediction for a given point vary between different realizations of the model? Represents errors due to the model’s sensitivity to small fluctuations in the training set. \[ \text{Variance}(\hat{f}(x)) = E[\hat{f}(x)^2] - E[\hat{f}(x)]^2 \]

  3. Irreducible Error: Noise inherent in any real-world data that we cannot remove, no matter how good our model is.

Bayesian methods

We can think about Bayesian methods through the lens of the bias-variance trade-off

  • Priors add bias
  • Priors often reduce variance

Ridge Regression

Ridge Regression, modifies linear regression to include a regularization term. The regularization term discourages overly complex models which can overfit the training data.

\[ L(\beta) = \| Y - X^T \beta \|_2^2 + \lambda \| \beta \|_2^2 \] where \(\lambda\) is the regularization parameter and where \[ \| x \|_2 = \sqrt{\sum_{i=1}^n x_i^2} \] is the L2 norm.

Lasso Regression

Lasso Regression (Least Absolute Shrinkage and Selection Operator) includes an L1 penalty.

  1. Stronger penalty near zero will set some coefficients to almost exactly zero
  2. Smaller penalty far from zero

\[ L(\beta) = \| Y - X^T \beta \|_2^2 + \lambda \| \beta \|_1 \]

Wrapup

Today

  1. Motivation

  2. Supervised learning

  3. Loss Functions

  4. Bias-Variance Trade-Off

  5. Wrapup

Key ideas

  • Point estimates rather than quantifying posterior
    • Focus on more complex functions
    • Appropriate when “a good model” is more important than “the full distribution of the paramameters”
  • Measure quality of predictions using loss functions, which we can optimize
  • Bias-Variance Trade-Off
  • Regularization

Suggested reading

Chapter 2 of Friedman et al. (2001)

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning (Vol. 1). Springer series in statistics Springer, Berlin.