How can I calculate the bias-variance trade-off for my algorithm on my dataset?

Dr Dilek Celik
Nov 1, 2024
7 min read

In a real-life situation in which f is unobserved, it is generally not possible to explicitly compute the test MSE, bias, or variance for a statistical learning method. Nevertheless, one should always keep the bias-variance trade-off in mind.— Page 36, An Introduction to Statistical Learning with Applications in R, 2014.

Technically, we cannot perform this calculation. We cannot determine the actual bias and variance for a predictive modeling problem because the true mapping function is unknown. Instead, we utilize the concepts of bias, variance, irreducible error, and the bias-variance trade-off as tools to aid in model selection, configuration, and result interpretation.

The mlxtend library, developed by Sebastian Raschka, offers the bias_variance_decomp() function, which can estimate the bias and variance for a model using multiple bootstrap samples.

To get started, you need to install the mlxtend library. For example:

sudo pip install mlxtend

Here’s an example that demonstrates loading the Boston housing dataset directly from a URL, splitting it into training and test sets, and then estimating the mean squared error (MSE) for a linear regression model, as well as the bias and variance of the model error over 200 bootstrap samples.

# Estimate the bias and variance for a regression model

#Import Libraries
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.evaluate import bias_variance_decomp

# Load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)

# Separate into inputs and outputs
data = dataframe.values
X, y = data[:, :-1], data[:, -1]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# Define the model
model = LinearRegression()

# Estimate bias and variance
mse, bias, var = bias_variance_decomp(model, X_train, y_train, X_test, y_test, loss='mse', num_rounds=200, random_seed=1)

# Summarize results
print('MSE: %.3f' % mse)
print('Bias: %.3f' % bias)
print('Variance: %.3f' % var)

In this case, the model exhibits high bias and low variance, which is typical for a linear regression model. Additionally, the sum of the estimated bias and variance equals the model's estimated error, for example, 20.418 (bias) + 1.674 (variance) = 22.092 (total error).

Visualisation of Bias and Variance with Bootstrap Sampling

This code demonstrates how to estimate the bias and variance for a regression model using bootstrap sampling. The bias-variance decomposition helps to understand how well a model generalizes to unseen data by breaking down the prediction error into bias and variance components.

Bootstrap Sampling: Bootstrap is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement. It allows for multiple samples (with replacement) from the original dataset, enabling the estimation of the bias and variance.

Bias-Variance Decomposition: Bias measures how much the predicted values deviate from the actual values due to the model's assumptions. Variance measures how much the predictions for a given point vary between different training sets. Mean Squared Error (MSE) is the sum of bias squared and variance, plus irreducible error.

# Estimate the bias and variance for a regression model
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.evaluate import bias_variance_decomp
import numpy as np
import matplotlib.pyplot as plt

# Load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)

# Separate into inputs and outputs
data = dataframe.values
X, y = data[:, :-1], data[:, -1]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# Define the model
model = LinearRegression()

# Number of bootstrap samples
num_rounds = 100

# Initialize lists to store results
mse_list = []
bias_list = []
variance_list = []

# Estimate the bias and variance
for i in range(num_rounds):
    mse, bias, variance = bias_variance_decomp(model, X_train, y_train, X_test, y_test, loss='mse', num_rounds=200, random_seed=i)
    mse_list.append(mse)
    bias_list.append(bias)
    variance_list.append(variance)

# Convert to numpy arrays for easier manipulation
mse_array = np.array(mse_list)
bias_array = np.array(bias_list)
variance_array = np.array(variance_list)

# Plot the results
plt.figure(figsize=(12, 6))

plt.plot(range(num_rounds), mse_array, label='MSE', color='blue', linestyle='--', marker='o')
plt.plot(range(num_rounds), bias_array, label='Bias^2', color='red', linestyle='--', marker='o')
plt.plot(range(num_rounds), variance_array, label='Variance', color='green', linestyle='--', marker='o')

plt.xlabel('Bootstrap Sample')
plt.ylabel('Error')
plt.title('Bias-Variance Decomposition over Bootstrap Samples')
plt.legend()
plt.grid(True)
plt.show()

This code estimates and visualizes the bias and variance for a linear regression model using bootstrap sampling. By plotting these values over multiple bootstrap samples, you can analyze the model's performance and understand the tradeoff between bias and variance, helping to make informed decisions about model complexity and generalization.

The plot illustrates the bias-variance decomposition over 100 bootstrap samples for a linear regression model. Here’s an interpretation of the graph:

1. MSE (Mean Squared Error):

Represented by the blue dots.
It remains fairly consistent across the bootstrap samples, hovering around a value slightly above 20. This indicates that the overall error of the model doesn't vary much between different samples.

2. Bias^2:

Represented by the red dots.
The bias squared is consistently around 20, which is relatively high. This high bias indicates that the model is underfitting, meaning it is too simple to capture the underlying patterns in the data accurately.

3. Variance:

Represented by the green dots.
The variance is consistently low, remaining well below 5. This low variance indicates that the model's predictions are stable and do not vary much with different training data. This is typical for a linear regression model, which tends to have low variance and high bias.

4. Total Error:

The total error (MSE) is approximately the sum of the bias squared and the variance for each bootstrap sample. This is a key aspect of the bias-variance tradeoff: the overall error is a combination of these two components.

Key Takeaways:

High Bias: The consistently high bias squared (~20) suggests that the linear regression model is too simple for the data, leading to underfitting.
Low Variance: The low variance indicates that the model's predictions are not sensitive to changes in the training data.
Consistent Total Error: The total error (MSE) remains stable across different bootstrap samples, reflecting the combined effects of high bias and low variance.

Conclusion:

For this linear regression model, the high bias and low variance imply that the model is underfitting the data. To improve the model, you might consider increasing its complexity (e.g., using polynomial regression or a more complex model) to reduce the bias, even if it slightly increases the variance. The goal is to find a balance where both bias and variance are minimized, achieving a lower total error (MSE).

Bias-Variance Tradeoff and Model Complexity

Below is a Python code sample that generates a diagram illustrating the concepts of bias, variance, underfitting, and overfitting using matplotlib and numpy.

The plot will include:

Training error
Testing error
Bias (squared)
Variance
The regions of underfitting and overfitting

Make sure you have matplotlib and numpy installed:

import numpy as np
import matplotlib.pyplot as plt

# Generate model complexity values
complexity = np.linspace(0, 10, 400)

# Generate error values
bias_squared = (10 - complexity) ** 2 / 10
variance = complexity / 2
irreducible_error = 1
training_error = bias_squared + variance
testing_error = bias_squared + variance + irreducible_error

# Find the optimal model complexity (minimum testing error)
optimal_complexity = complexity[np.argmin(testing_error)]

# Plotting the diagram
plt.figure(figsize=(12, 8))
plt.plot(complexity, training_error, label='Training Error', color='blue')
plt.plot(complexity, testing_error, label='Testing Error', color='red')
plt.plot(complexity, bias_squared, label='Bias^2', color='green', linestyle='--')
plt.plot(complexity, variance, label='Variance', color='orange', linestyle='--')

# Highlight underfitting and overfitting regions
plt.axvspan(0, 3, color='red', alpha=0.1, label='Underfitting')
plt.axvspan(7, 10, color='red', alpha=0.1, label='Overfitting')

# Show optimal model complexity
plt.axvline(x=optimal_complexity, color='gray', linestyle='--', label='Optimal Model Complexity')

# Add labels and title
plt.xlabel('Model Complexity')
plt.ylabel('Error')
plt.title('Bias-Variance Tradeoff and Model Complexity')
plt.legend()
plt.grid(True)
plt.show()

The provided plot illustrates the bias-variance tradeoff and the effect of model complexity on training and testing errors. Here's a detailed explanation of the plot:

Components of the Plot

1. X-Axis (Model Complexity): This axis represents the complexity of the model, which could be, for instance, the degree of a polynomial in polynomial regression. Higher values indicate more complex models.

2. Y-Axis (Error): This axis represents the error, typically measured as Mean Squared Error (MSE) for both training and testing datasets.

3. Curves:

- Training Error (Blue Line): Shows the error on the training data. As model complexity increases, the training error usually decreases because the model fits the training data more closely.

- Testing Error (Red Line): Shows the error on the testing data. Initially, as model complexity increases, the testing error decreases, indicating better generalization. However, after a certain point, the testing error starts to increase, indicating overfitting.

- Bias^2 (Green Dashed Line): Represents the bias error, which decreases as model complexity increases. High bias (simple models) leads to underfitting.

- Variance (Orange Dashed Line): Represents the variance error, which increases as model complexity increases. High variance (complex models) leads to overfitting.

4. Shaded Regions:

- Underfitting (Left Red Shaded Area): This region indicates low model complexity where both training and testing errors are high. The model is too simple to capture the underlying patterns in the data.

- Overfitting (Right Red Shaded Area): This region indicates high model complexity where the training error is low, but the testing error is high. The model captures noise and fluctuations in the training data, reducing its generalization capability.

5. Optimal Model Complexity (Vertical Gray Dashed Line): This line marks the point of optimal model complexity where the testing error is minimized. It represents the best tradeoff between bias and variance.

Key Takeaways

- Bias-Variance Tradeoff: The plot visualizes the tradeoff between bias and variance as model complexity increases. Low complexity models have high bias and low variance (underfitting), while high complexity models have low bias and high variance (overfitting).

- Optimal Complexity: The optimal point where the testing error is minimized shows the best balance between bias and variance. Models at this point are expected to generalize well to new, unseen data.

- Training vs. Testing Error: The training error decreases monotonically with increasing complexity, but the testing error initially decreases and then increases, forming a U-shaped curve. This highlights the importance of selecting an appropriate model complexity to avoid both underfitting and overfitting.

Understanding this plot helps in selecting the right model complexity to achieve the best generalization performance on new data.

#AI #ArtificialIntelligence #MachineLearning #ML #DeepLearning #DL #DataScience #BigData #Analytics #DataMining #NeuralNetworks #NLP #NaturalLanguageProcessing #ComputerVision #CV #ReinforcementLearning #RL #SupervisedLearning #UnsupervisedLearning #AIResearch #AIApplications #PredictiveAnalytics #AIethics #AItechnology #AIinnovation #AIdevelopment #SmartTechnology #IntelligentSystems

AI Consultant, Dr DILEK CELIK, PhD

How can I calculate the bias-variance trade-off for my algorithm on my dataset?

Visualisation of Bias and Variance with Bootstrap Sampling

Key Takeaways:

Conclusion:

Bias-Variance Tradeoff and Model Complexity

Recent Posts

Comments