Principle:DistrictDataLabs Yellowbrick Prediction Error Analysis

Knowledge Sources	Yellowbrick Docs Yellowbrick
Domains	Machine_Learning, Regression, Model_Evaluation
Last Updated	2026-02-08 00:00 GMT

Overview

Prediction error analysis is a diagnostic technique that evaluates regression model accuracy by plotting predicted values against actual observed values and comparing the result to the identity line.

Description

In a prediction error plot, the actual target values $y$ are placed on the horizontal axis and the corresponding model predictions $\hat{y}$ are placed on the vertical axis. Each observation becomes a single point in this scatter plot. If the model were perfect, every point would fall exactly on the 45-degree identity line $\hat{y} = y$ . Deviations from this line reveal the nature and magnitude of prediction errors.

By overlaying both the identity line and a best-fit line through the scatter, an analyst can quickly diagnose systematic bias. When the best-fit line closely follows the identity line, the model is well-calibrated. When the best-fit line diverges, the slope and intercept reveal whether the model is systematically over-predicting or under-predicting. For example, a best-fit line with a slope less than 1 indicates that the model under-predicts high values and over-predicts low values, a phenomenon known as regression toward the mean. Conversely, a slope greater than 1 suggests the opposite pattern.

The prediction error plot also helps detect heteroscedasticity: if the scatter of points fans out or narrows across the range of actual values, the variance of the model errors is not constant. Clusters or gaps in the plot can reveal regions of the target domain where the model performs well or poorly, guiding targeted model improvement.

Usage

Prediction error analysis is most useful when:

Assessing whether a regression model is well-calibrated across the full range of the target variable
Diagnosing systematic over-prediction or under-prediction bias
Detecting heteroscedasticity (non-constant error variance) across different regions of the target domain
Comparing multiple models by overlaying their prediction error plots
Communicating model performance to stakeholders who may find residual plots less intuitive

Theoretical Basis

The prediction error plot is based on the relationship between actual and predicted values. For a perfect model:

${\hat{y}}_{i} = y_{i} \forall i$

which corresponds to the identity line with slope 1 and intercept 0.

The best-fit line through the scatter of $(y_{i}, {\hat{y}}_{i})$ pairs is computed via ordinary least squares:

$\hat{y} = β_{0} + β_{1} y$

where:

$β_{1} = \frac{\sum_{i = 1}^{n} (y_{i} - \bar{y}) ({\hat{y}}_{i} - \bar{\hat{y}})}{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}$

$β_{0} = \bar{\hat{y}} - β_{1} \bar{y}$

For an ideal model, $β_{1} = 1$ and $β_{0} = 0$ . Deviations from these values quantify the systematic bias.

The goodness-of-fit is summarized by the $R^{2}$ score:

$R^{2} = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}$

Using shared axis limits (so both axes span the same range) creates a square plot where the identity line is a true 45-degree diagonal, making it visually straightforward to assess the magnitude and direction of errors.

Related Pages

Implemented By

Implementation:DistrictDataLabs_Yellowbrick_PredictionError_Visualizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment