Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DistrictDataLabs Yellowbrick Residual Analysis

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Regression, Model_Evaluation
Last Updated 2026-02-08 00:00 GMT

Overview

Residual analysis is a diagnostic technique that evaluates the adequacy of a regression model by examining the differences between observed and predicted values.

Description

In regression modeling, the residual for a given observation is defined as the difference between the predicted value and the actual observed value. Formally, for observation i, the residual is ei=y^iyi. Residual analysis involves plotting these residuals against predicted values (or other variables) to detect systematic patterns that would indicate a violation of regression assumptions.

A well-fitted regression model produces residuals that are randomly scattered around the horizontal axis (zero line) with no discernible pattern. When the residuals display a non-random structure, such as a curved pattern, a funnel shape, or clustering, the model is likely misspecified. For instance, a U-shaped or parabolic pattern in the residuals suggests that a linear model is inadequate and that a polynomial or non-linear model may be more appropriate. A funnel-shaped spread indicates heteroscedasticity, meaning the variance of errors is not constant across the range of predicted values.

Beyond the scatter plot of residuals versus predicted values, residual analysis can be augmented with a histogram of residuals to inspect their distribution. Ideally, residuals should follow a normal distribution centered at zero. Alternatively, a Q-Q (quantile-quantile) plot can compare the residual quantiles against those of a standard normal distribution. Departures from the diagonal line in a Q-Q plot signal non-normality in the error terms, which can affect the validity of confidence intervals and hypothesis tests derived from the model.

Usage

Residual analysis should be employed after fitting any regression model as a standard diagnostic step. It is particularly important when:

  • Verifying that the linearity assumption holds for the chosen model
  • Checking for heteroscedasticity (non-constant variance) in the error terms
  • Identifying outliers or influential data points that disproportionately affect the model
  • Comparing the adequacy of training versus test data fits by overlaying residuals from both splits
  • Deciding whether to upgrade from a linear model to a more flexible non-linear approach

Theoretical Basis

The residual for observation i is defined as:

ei=y^iyi

where y^i is the predicted value from the regression model and yi is the observed value.

Under the standard linear regression assumptions (the Gauss-Markov conditions), the error terms ϵi are assumed to satisfy:

  • Zero mean: E[ϵi]=0
  • Constant variance (homoscedasticity): Var(ϵi)=σ2 for all i
  • Independence: Cov(ϵi,ϵj)=0 for ij
  • Normality: ϵiN(0,σ2)

The goodness-of-fit is commonly summarized by the coefficient of determination:

R2=1i=1n(yiy^i)2i=1n(yiy¯)2

An R2 value close to 1 indicates that the model explains most of the variance in the target variable. However, R2 alone is insufficient; residual plots can reveal model inadequacies that a single summary statistic cannot capture. Anscombe's quartet famously demonstrated that very different datasets can share nearly identical summary statistics, making visual inspection of residuals essential.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment