Principle:DistrictDataLabs Yellowbrick Residual Analysis

Knowledge Sources	Yellowbrick Docs Yellowbrick Anscombe, F. J. (1973). "Graphs in Statistical Analysis". The American Statistician. 27 (1): 17-21.
Domains	Machine_Learning, Regression, Model_Evaluation
Last Updated	2026-02-08 00:00 GMT

Overview

Residual analysis is a diagnostic technique that evaluates the adequacy of a regression model by examining the differences between observed and predicted values.

Description

In regression modeling, the residual for a given observation is defined as the difference between the predicted value and the actual observed value. Formally, for observation $i$ , the residual is $e_{i} = {\hat{y}}_{i} - y_{i}$ . Residual analysis involves plotting these residuals against predicted values (or other variables) to detect systematic patterns that would indicate a violation of regression assumptions.

A well-fitted regression model produces residuals that are randomly scattered around the horizontal axis (zero line) with no discernible pattern. When the residuals display a non-random structure, such as a curved pattern, a funnel shape, or clustering, the model is likely misspecified. For instance, a U-shaped or parabolic pattern in the residuals suggests that a linear model is inadequate and that a polynomial or non-linear model may be more appropriate. A funnel-shaped spread indicates heteroscedasticity, meaning the variance of errors is not constant across the range of predicted values.

Beyond the scatter plot of residuals versus predicted values, residual analysis can be augmented with a histogram of residuals to inspect their distribution. Ideally, residuals should follow a normal distribution centered at zero. Alternatively, a Q-Q (quantile-quantile) plot can compare the residual quantiles against those of a standard normal distribution. Departures from the diagonal line in a Q-Q plot signal non-normality in the error terms, which can affect the validity of confidence intervals and hypothesis tests derived from the model.

Usage

Residual analysis should be employed after fitting any regression model as a standard diagnostic step. It is particularly important when:

Verifying that the linearity assumption holds for the chosen model
Checking for heteroscedasticity (non-constant variance) in the error terms
Identifying outliers or influential data points that disproportionately affect the model
Comparing the adequacy of training versus test data fits by overlaying residuals from both splits
Deciding whether to upgrade from a linear model to a more flexible non-linear approach

Theoretical Basis

The residual for observation $i$ is defined as:

$e_{i} = {\hat{y}}_{i} - y_{i}$

where ${\hat{y}}_{i}$ is the predicted value from the regression model and $y_{i}$ is the observed value.

Under the standard linear regression assumptions (the Gauss-Markov conditions), the error terms $ϵ_{i}$ are assumed to satisfy:

Zero mean: $E [ϵ_{i}] = 0$
Constant variance (homoscedasticity): $Var (ϵ_{i}) = σ^{2}$ for all $i$
Independence: $Cov (ϵ_{i}, ϵ_{j}) = 0$ for $i \neq j$
Normality: $ϵ_{i} \sim N (0, σ^{2})$

The goodness-of-fit is commonly summarized by the coefficient of determination:

$R^{2} = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}$

An $R^{2}$ value close to 1 indicates that the model explains most of the variance in the target variable. However, $R^{2}$ alone is insufficient; residual plots can reveal model inadequacies that a single summary statistic cannot capture. Anscombe's quartet famously demonstrated that very different datasets can share nearly identical summary statistics, making visual inspection of residuals essential.

Related Pages

Implemented By

Implementation:DistrictDataLabs_Yellowbrick_ResidualsPlot_Visualizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment