Principle:DistrictDataLabs Yellowbrick Residual Analysis
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Regression, Model_Evaluation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Residual analysis is a diagnostic technique that evaluates the adequacy of a regression model by examining the differences between observed and predicted values.
Description
In regression modeling, the residual for a given observation is defined as the difference between the predicted value and the actual observed value. Formally, for observation , the residual is . Residual analysis involves plotting these residuals against predicted values (or other variables) to detect systematic patterns that would indicate a violation of regression assumptions.
A well-fitted regression model produces residuals that are randomly scattered around the horizontal axis (zero line) with no discernible pattern. When the residuals display a non-random structure, such as a curved pattern, a funnel shape, or clustering, the model is likely misspecified. For instance, a U-shaped or parabolic pattern in the residuals suggests that a linear model is inadequate and that a polynomial or non-linear model may be more appropriate. A funnel-shaped spread indicates heteroscedasticity, meaning the variance of errors is not constant across the range of predicted values.
Beyond the scatter plot of residuals versus predicted values, residual analysis can be augmented with a histogram of residuals to inspect their distribution. Ideally, residuals should follow a normal distribution centered at zero. Alternatively, a Q-Q (quantile-quantile) plot can compare the residual quantiles against those of a standard normal distribution. Departures from the diagonal line in a Q-Q plot signal non-normality in the error terms, which can affect the validity of confidence intervals and hypothesis tests derived from the model.
Usage
Residual analysis should be employed after fitting any regression model as a standard diagnostic step. It is particularly important when:
- Verifying that the linearity assumption holds for the chosen model
- Checking for heteroscedasticity (non-constant variance) in the error terms
- Identifying outliers or influential data points that disproportionately affect the model
- Comparing the adequacy of training versus test data fits by overlaying residuals from both splits
- Deciding whether to upgrade from a linear model to a more flexible non-linear approach
Theoretical Basis
The residual for observation is defined as:
where is the predicted value from the regression model and is the observed value.
Under the standard linear regression assumptions (the Gauss-Markov conditions), the error terms are assumed to satisfy:
- Zero mean:
- Constant variance (homoscedasticity): for all
- Independence: for
- Normality:
The goodness-of-fit is commonly summarized by the coefficient of determination:
An value close to 1 indicates that the model explains most of the variance in the target variable. However, alone is insufficient; residual plots can reveal model inadequacies that a single summary statistic cannot capture. Anscombe's quartet famously demonstrated that very different datasets can share nearly identical summary statistics, making visual inspection of residuals essential.