Workflow:DistrictDataLabs Yellowbrick Regression Model Evaluation

Knowledge Sources	Yellowbrick Yellowbrick Docs Regressor Visualizers
Domains	Machine_Learning, Regression, Model_Evaluation
Last Updated	2026-02-08 12:00 GMT

Overview

End-to-end process for visually evaluating and diagnosing scikit-learn regression models using Yellowbrick's regressor visualizers.

Description

This workflow covers the standard procedure for evaluating regression models through visual diagnostics. It uses Yellowbrick's regressor visualizers to assess model fit quality, detect heteroskedasticity, identify influential outliers, and tune regularization parameters. The process starts with loading data, proceeds through residual analysis and prediction error plotting, and optionally includes alpha selection for regularized models and Cook's distance analysis for outlier detection.

Key outputs:

Residuals plot showing error distribution against predicted values
Prediction error plot comparing actual vs. predicted values
Alpha selection curve for regularization parameter tuning
Cook's distance plot identifying high-influence data points

Usage

Execute this workflow when you have a continuous target variable dataset and a scikit-learn regression estimator, and you need to visually diagnose model fit quality. This is particularly useful for detecting patterns in residuals that indicate model misspecification, identifying outlier data points, or selecting the optimal regularization strength.

Execution Steps

Step 1: Load and Split Data

Load the regression dataset and split it into training and test sets. Yellowbrick expects a feature matrix X and continuous target vector y in the same format as scikit-learn.

Key considerations:

Use Yellowbrick's built-in loaders (e.g., load_bikeshare, load_concrete, load_energy) for experimentation
Use sklearn's train_test_split with an appropriate test size (e.g., 10-20%)
Consider feature scaling if your model family requires it

Step 2: Analyze Residuals

Wrap a regression estimator in Yellowbrick's ResidualsPlot visualizer. Fit on training data, score on test data, and render the residuals plot. This shows the distribution of errors (actual minus predicted) against the predicted values, colored by train/test split.

What to look for:

Heteroskedasticity: error magnitude increasing with predicted value
Patterns in residuals indicating missing nonlinear relationships
Differences between train and test residual distributions (overfit/underfit signals)
The histogram of residuals should approximate a normal distribution for OLS assumptions

Step 3: Evaluate Prediction Error

Use the PredictionError visualizer to plot actual vs. predicted values. The 45-degree identity line represents perfect predictions; deviations show systematic over- or under-prediction.

What to look for:

Points clustering around the identity line indicate good fit
Systematic deviations reveal regions where the model struggles
Density of predictions in specific value ranges

Step 4: Tune Regularization (Optional)

For regularized models (Ridge, Lasso, ElasticNet), use the AlphaSelection visualizer to find the optimal regularization parameter. This wraps cross-validated estimators like RidgeCV and visualizes how error changes across alpha values.

Key considerations:

Alpha and complexity have an inverse relationship
The optimal alpha balances bias and variance
Use np.logspace to generate a range of candidate alpha values

Step 5: Detect Influential Outliers (Optional)

Use the CooksDistance visualizer to identify data points with disproportionate influence on the regression model. High Cook's distance values indicate observations that strongly affect model coefficients when removed.

What to look for:

Points exceeding the significance threshold line
Clusters of influential points may indicate data quality issues
Consider investigating or removing high-influence outliers

Step 6: Render and Compare

Render all visualizations and use the combined diagnostic picture to decide on model improvements: feature engineering, model family changes, regularization, or data cleaning.

Key considerations:

Save plots to disk for reporting via show(outpath="filename.png")
Quick methods (residuals_plot, prediction_error, cooks_distance) enable one-liner comparisons
Iterate on the model based on visual diagnostic findings

Execution Diagram

GitHub URL

Workflow Repository