Principle:DistrictDataLabs Yellowbrick Prediction Error Analysis
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Regression, Model_Evaluation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Prediction error analysis is a diagnostic technique that evaluates regression model accuracy by plotting predicted values against actual observed values and comparing the result to the identity line.
Description
In a prediction error plot, the actual target values are placed on the horizontal axis and the corresponding model predictions are placed on the vertical axis. Each observation becomes a single point in this scatter plot. If the model were perfect, every point would fall exactly on the 45-degree identity line . Deviations from this line reveal the nature and magnitude of prediction errors.
By overlaying both the identity line and a best-fit line through the scatter, an analyst can quickly diagnose systematic bias. When the best-fit line closely follows the identity line, the model is well-calibrated. When the best-fit line diverges, the slope and intercept reveal whether the model is systematically over-predicting or under-predicting. For example, a best-fit line with a slope less than 1 indicates that the model under-predicts high values and over-predicts low values, a phenomenon known as regression toward the mean. Conversely, a slope greater than 1 suggests the opposite pattern.
The prediction error plot also helps detect heteroscedasticity: if the scatter of points fans out or narrows across the range of actual values, the variance of the model errors is not constant. Clusters or gaps in the plot can reveal regions of the target domain where the model performs well or poorly, guiding targeted model improvement.
Usage
Prediction error analysis is most useful when:
- Assessing whether a regression model is well-calibrated across the full range of the target variable
- Diagnosing systematic over-prediction or under-prediction bias
- Detecting heteroscedasticity (non-constant error variance) across different regions of the target domain
- Comparing multiple models by overlaying their prediction error plots
- Communicating model performance to stakeholders who may find residual plots less intuitive
Theoretical Basis
The prediction error plot is based on the relationship between actual and predicted values. For a perfect model:
which corresponds to the identity line with slope 1 and intercept 0.
The best-fit line through the scatter of pairs is computed via ordinary least squares:
where:
For an ideal model, and . Deviations from these values quantify the systematic bias.
The goodness-of-fit is summarized by the score:
Using shared axis limits (so both axes span the same range) creates a square plot where the identity line is a true 45-degree diagonal, making it visually straightforward to assess the magnitude and direction of errors.