Lecture
2023-10-06
We recorded the number of accidents and average vehicle speed in a city over several months:
Month | Average Speed (km/h) | Number of Accidents |
---|---|---|
Jan | 45 | 10 |
Feb | 50 | 8 |
Mar | 55 | 12 |
Apr | 60 | 15 |
Today
Motivation
Supervised learning
Loss Functions
Bias-Variance Trade-Off
Wrapup
We want to make predictions about the value of \(y\) given some \(x\).
The true function will be \[ f(x) = 2x + x \sin(2 \pi x) \] but let’s assume we don’t know it.
\[ y | X \sim \mathcal{N}(X^T \beta, \sigma^2 I) \]
Notation
We will frequently use this linear algebra notation, which is equivalent to writing \[ y_i | X_i \sim \mathcal{N} \left(\sum_{j=1}^J X_{ij} \beta_j, \sigma^2 \right). \]
We’ve seen this before:
@model function lin_reg(X, y)
N, P = size(X)
β ~ MvNormal(zeros(P), 10 * ones(P)) # prior: βₚ ~ N(0, 10)
σ ~ InverseGamma(1, 5)
return y ~ MvNormal(X * β, σ)
end
lin_reg (generic function with 2 methods)
Chains MCMC chain (5000×4×4 Array{Float64, 3}): Iterations = 1:1:5000 Number of chains = 4 Samples per chain = 5000 Wall duration = 7.19 seconds Compute duration = 5.81 seconds parameters = β[1], β[2], σ internals = lp Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ β[1] 0.0015 0.2740 0.0033 6826.1387 8822.3325 1.0005 ⋯ β[2] 1.7479 0.2267 0.0027 6885.6942 8616.7500 1.0005 ⋯ σ 1.3674 0.0994 0.0010 10957.0709 11254.4262 1.0005 ⋯ 1 column omitted Quantiles parameters 2.5% 25.0% 50.0% 75.0% 97.5% Symbol Float64 Float64 Float64 Float64 Float64 β[1] -0.5366 -0.1865 0.0046 0.1854 0.5389 β[2] 1.3019 1.5954 1.7459 1.9002 2.1950 σ 1.1908 1.2977 1.3601 1.4306 1.5790
We can calcualte a point estimate instead of sampling from the posterior. This might make sense if:
We consider two point estimates:
reg_model = lin_reg(X, y)
1θ̂_map = optimize(reg_model, MAP()).values.array
θ̂_mle = optimize(reg_model, MLE()).values.array
f̂_mle(x) = θ̂_mle[1] + θ̂_mle[2] * x
f̂_map(x) = θ̂_map[1] + θ̂_map[2] * x
p2 = plot(p_base)
x_plot = range(minimum(x), maximum(x); length=250)
plot!(p2, x_plot, f̂_mle.(x_plot); label="MLE", linewidth=3)
plot!(p2, x_plot, f̂_map.(x_plot); label="MAP", alpha=0.5, linewidth=2)
optimize
so we can plug in our model (with data) and MAP()
or MLE()
and it will figure out the loss function and upper/lower bounds for us.
One way we can tell that our fit is terrible is by plotting the residuals. We have assumed that the residuals are IID. However, we can see that the residuals are correlated with our predictor!
Without changing our data generating process, we can expand our regression model to include more features. For example: \[ \mu_i = \beta_0 + \sum_{p=1}^P \beta_p x_{p}^ p \] but we still have \[ y_i | X_i \sim \mathcal{N}(\mu_i, \sigma^2) \]
function log_lik_poly(β, σ, x, y)
μ = [sum([βi * xi^(i - 1) for (i, βi) in enumerate(β)]) for xi in x]
return sum(logpdf.(Normal.(μ, σ), y))
end
order = 7
loss_poly(θ) = -log_lik_poly(θ[1:(order+1)], θ[order+2], x, y)
lower = -Inf * ones(order + 2)
lower[end] = 0
upper = Inf * ones(order + 2)
guess = 0.5 * ones(order + 2)
res = optimize(loss_poly, lower, upper, guess)
θ̂_poly = res.minimizer
f̂_poly(x) = θ̂_poly[1] + sum(θ̂_poly[1+k] * x^k for k in 1:order)
p4 = plot(p_base)
plot!(p4, x_plot, f.(x_plot); label="True Function")
plot!(p4, x_plot, f̂_poly.(x_plot); label="Polynomial Fit K=$order")
What are some possible problems with this approach?
Today
Motivation
Supervised learning
Loss Functions
Bias-Variance Trade-Off
Wrapup
We are interested generally in estimating some function \(f\) that maps inputs \(X\) to outputs \(y\). \[ y = f(X) + \epsilon \]
\(K\) nearest neighbors (KNN): find the \(K\) training examples that are closest to a given input and returns the average output.
Note
Nonparametric does not mean that there are no parameters. For example, \(K\) is a parameter!
# data structure to make nearest neighbor search fast -- don't worry about this
kdtree = KDTree(transpose(x))
function knn_predict(xi, k)
# find the k nearest neighbors
idxs, _ = knn(kdtree, [xi], k, true)
# return the average of the y values
return mean(y[idxs])
end
ŷ_3 = knn_predict.(x_plot, 3)
ŷ_10 = knn_predict.(x_plot, 10)
p5 = plot(p_base)
plot!(p5, x_plot, ŷ_3; label="K=3")
plot!(p5, x_plot, ŷ_10; label="K=10")
What are some problems with this \(K\)NN model?
Parametric methods model the function \(f\) using some parameters \(\theta\). Thus, finding \(\hat{f}\) is equivalent to choosing appropriate \(\theta\).
The linear regression example we’ve been working with is a parametric function approximation.
Today
Motivation
Supervised learning
Loss Functions
Bias-Variance Trade-Off
Wrapup
We need to define what we mean by a “best” approximation.
# Sample residuals
residuals = -2:0.01:2
# Loss functions
mse_loss = residuals .^ 2
mae_loss = abs.(residuals)
huber_loss = [abs(r) <= 1 ? 0.5 * r^2 : abs(r) - 0.5 for r in residuals]
quantile_tau = 0.5 # example quantile value
quantile_loss = [r > 0 ? quantile_tau * r : (1 - quantile_tau) * r for r in residuals]
# Plot
plot(residuals, mse_loss; label="MSE", lw=2)
plot!(residuals, mae_loss; label="MAE", lw=2)
plot!(residuals, huber_loss; label="Huber", lw=2)
plot!(residuals, quantile_loss; label="Quantile (τ=0.5)", lw=2)
xlabel!(L"Residual ($y - \hat{y}$)")
ylabel!("Loss")
title!("Loss Functions Visualization")
Today
Motivation
Supervised learning
Loss Functions
Bias-Variance Trade-Off
Wrapup
Key Insight: Every model error can be decomposed into bias, variance, and irreducible error. Balancing bias and variance is crucial for creating models that perform well on both training and unseen data.
The expected prediction error for any machine learning algorithm can be broken down as:
\[ \text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} \]
Bias: How much on average are the predicted values different from the actual values? Represents errors due to overly simplistic assumptions in the learning algorithm. \[ \text{Bias}(\hat{f}(x)) = E[\hat{f}(x) - f(x)] \]
Variance: How much does the prediction for a given point vary between different realizations of the model? Represents errors due to the model’s sensitivity to small fluctuations in the training set. \[ \text{Variance}(\hat{f}(x)) = E[\hat{f}(x)^2] - E[\hat{f}(x)]^2 \]
Irreducible Error: Noise inherent in any real-world data that we cannot remove, no matter how good our model is.
We can think about Bayesian methods through the lens of the bias-variance trade-off
Ridge Regression, modifies linear regression to include a regularization term. The regularization term discourages overly complex models which can overfit the training data.
\[ L(\beta) = \| Y - X^T \beta \|_2^2 + \lambda \| \beta \|_2^2 \] where \(\lambda\) is the regularization parameter and where \[ \| x \|_2 = \sqrt{\sum_{i=1}^n x_i^2} \] is the L2 norm.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) includes an L1 penalty.
\[ L(\beta) = \| Y - X^T \beta \|_2^2 + \lambda \| \beta \|_1 \]
Today
Motivation
Supervised learning
Loss Functions
Bias-Variance Trade-Off
Wrapup
Chapter 2 of Friedman et al. (2001)