Principle:DistrictDataLabs Yellowbrick Influential Outlier Detection

Knowledge Sources	Yellowbrick Docs Yellowbrick Cook, R. D. (1977). "Detection of Influential Observation in Linear Regression". Technometrics. 19 (1): 15-18. Cook, R. D. & Weisberg, S. (1982). "Residuals and Influence in Regression". Chapman and Hall.
Domains	Machine_Learning, Regression, Model_Evaluation
Last Updated	2026-02-08 00:00 GMT

Overview

Influential outlier detection identifies individual data points that disproportionately affect the estimated parameters of a regression model, using Cook's Distance as the primary diagnostic measure.

Description

In regression analysis, not all observations contribute equally to the fitted model. Some data points, by virtue of their position in the feature space or their extreme target values, can have an outsized influence on the estimated regression coefficients. Removing or including such points can substantially change the model's predictions. Identifying these influential observations is critical for building robust and reliable regression models.

Cook's Distance is the most widely used measure of influence. It quantifies, for each observation, how much the entire set of fitted values would change if that observation were removed from the dataset. A large Cook's Distance indicates that the observation is highly influential. The measure combines two concepts: the leverage of the observation (how far it is from the center of the feature space) and the magnitude of its residual (how far the actual value is from the predicted value). A point with high leverage and a large residual will have a large Cook's Distance, making it a strong candidate for investigation.

The standard rule of thumb for flagging influential observations is $D_{i} > 4 / n$ , where $n$ is the number of observations. Points exceeding this threshold warrant closer inspection. They may represent data entry errors, measurement anomalies, or genuinely unusual cases that the model should perhaps account for differently. It is important to note that influential points are not necessarily "bad" data; they may represent real phenomena that deserve special attention.

Usage

Influential outlier detection via Cook's Distance should be applied when:

Performing initial exploratory analysis on regression data to identify potentially problematic observations
Diagnosing unexpected model behavior or poor generalization performance
Cleaning data before training a production regression model
Assessing the stability of regression coefficients by understanding which points drive the estimates
Comparing the robustness of different regression specifications

Theoretical Basis

Cook's Distance for observation $i$ is defined as:

$D_{i} = \frac{(\hat{y} - {\hat{y}}_{(i)})^{T} (\hat{y} - {\hat{y}}_{(i)})}{p \cdot MSE}$

where $\hat{y}$ is the vector of fitted values from the full model, ${\hat{y}}_{(i)}$ is the vector of fitted values when observation $i$ is deleted, $p$ is the number of parameters, and $MSE$ is the mean squared error.

This can be equivalently expressed using the leverage and studentized residuals:

$D_{i} = \frac{e_{i}^{* 2}}{p} \cdot \frac{h_{i i}}{1 - h_{i i}}$

where $e_{i}^{*}$ is the internally studentized residual:

$e_{i}^{*} = \frac{e_{i}}{\sqrt{MSE} \cdot \sqrt{1 - h_{i i}}}$

and $h_{i i}$ is the leverage, computed as the $i$ -th diagonal element of the hat matrix:

$H = X (X^{T} X)^{- 1} X^{T}$

The leverage $h_{i i}$ measures how far observation $i$ is from the centroid of the feature space. Observations with high leverage have more potential to influence the regression fit.

The influence threshold rule of thumb is:

$I_{t} = \frac{4}{n}$

Observations where $D_{i} > I_{t}$ are flagged as potentially influential outliers.

The p-values associated with Cook's Distance are derived from the F-distribution:

$p_{i} = 1 - F_{c d f} (D_{i}; p, n - p)$

where $F_{c d f}$ is the cumulative distribution function of the F-distribution with $p$ and $n - p$ degrees of freedom.

Related Pages

Implemented By

Implementation:DistrictDataLabs_Yellowbrick_CooksDistance_Visualizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment