Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DistrictDataLabs Yellowbrick Influential Outlier Detection

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Regression, Model_Evaluation
Last Updated 2026-02-08 00:00 GMT

Overview

Influential outlier detection identifies individual data points that disproportionately affect the estimated parameters of a regression model, using Cook's Distance as the primary diagnostic measure.

Description

In regression analysis, not all observations contribute equally to the fitted model. Some data points, by virtue of their position in the feature space or their extreme target values, can have an outsized influence on the estimated regression coefficients. Removing or including such points can substantially change the model's predictions. Identifying these influential observations is critical for building robust and reliable regression models.

Cook's Distance is the most widely used measure of influence. It quantifies, for each observation, how much the entire set of fitted values would change if that observation were removed from the dataset. A large Cook's Distance indicates that the observation is highly influential. The measure combines two concepts: the leverage of the observation (how far it is from the center of the feature space) and the magnitude of its residual (how far the actual value is from the predicted value). A point with high leverage and a large residual will have a large Cook's Distance, making it a strong candidate for investigation.

The standard rule of thumb for flagging influential observations is Di>4/n, where n is the number of observations. Points exceeding this threshold warrant closer inspection. They may represent data entry errors, measurement anomalies, or genuinely unusual cases that the model should perhaps account for differently. It is important to note that influential points are not necessarily "bad" data; they may represent real phenomena that deserve special attention.

Usage

Influential outlier detection via Cook's Distance should be applied when:

  • Performing initial exploratory analysis on regression data to identify potentially problematic observations
  • Diagnosing unexpected model behavior or poor generalization performance
  • Cleaning data before training a production regression model
  • Assessing the stability of regression coefficients by understanding which points drive the estimates
  • Comparing the robustness of different regression specifications

Theoretical Basis

Cook's Distance for observation i is defined as:

Di=(y^y^(i))T(y^y^(i))pMSE

where y^ is the vector of fitted values from the full model, y^(i) is the vector of fitted values when observation i is deleted, p is the number of parameters, and MSE is the mean squared error.

This can be equivalently expressed using the leverage and studentized residuals:

Di=ei*2phii1hii

where ei* is the internally studentized residual:

ei*=eiMSE1hii

and hii is the leverage, computed as the i-th diagonal element of the hat matrix:

H=X(XTX)1XT

The leverage hii measures how far observation i is from the centroid of the feature space. Observations with high leverage have more potential to influence the regression fit.

The influence threshold rule of thumb is:

It=4n

Observations where Di>It are flagged as potentially influential outliers.

The p-values associated with Cook's Distance are derived from the F-distribution:

pi=1Fcdf(Di;p,np)

where Fcdf is the cumulative distribution function of the F-distribution with p and np degrees of freedom.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment