Principle:Scikit learn Scikit learn Robust Regression
| Knowledge Sources | |
|---|---|
| Domains | Supervised Learning, Regression |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Robust regression comprises methods that are resistant to the influence of outliers and violations of standard distributional assumptions in the data.
Description
Standard least squares regression is highly sensitive to outliers because the squared error loss amplifies the influence of extreme observations. Robust regression methods mitigate this by either down-weighting or excluding outliers during model fitting. These techniques solve the problem of obtaining reliable regression estimates when data contains contaminated observations, leverage points, or heavy-tailed error distributions. They occupy a critical niche between ordinary linear models and fully non-parametric approaches.
Usage
Use robust regression when the dataset is suspected to contain outliers that would distort ordinary least squares estimates. RANSAC is preferred when a large fraction of data may be outliers (it explicitly separates inliers from outliers). Huber regression is appropriate when outlier contamination is moderate and you want a smooth transition between squared loss and absolute loss. Theil-Sen is useful for datasets with up to approximately 29.3% contamination and is especially effective in low-dimensional settings due to its high breakdown point.
Theoretical Basis
RANSAC (RANdom SAmple Consensus) is an iterative algorithm:
- Randomly select a minimal subset of data points.
- Fit a model to this subset.
- Count the number of data points (inliers) within a residual threshold .
- Repeat for a fixed number of iterations and keep the model with the most inliers.
- Refit the model using all identified inliers.
The probability of finding an outlier-free sample in iterations is:
where is the outlier ratio and is the sample size.
Huber Regression uses the Huber loss function:
This loss is quadratic for small residuals and linear for large residuals, providing a smooth compromise between and loss. The parameter controls the threshold between the two regimes.
Theil-Sen Estimator computes the slope as the median of all pairwise slopes:
In higher dimensions, it generalizes by computing slopes over random subsamples. The median operation gives it a breakdown point of approximately 29.3%, meaning it tolerates up to that fraction of arbitrary outliers.