Principle:Scikit learn Scikit learn Robust Regression

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Supervised Learning, Regression
Last Updated	2026-02-08 15:00 GMT

Overview

Robust regression comprises methods that are resistant to the influence of outliers and violations of standard distributional assumptions in the data.

Description

Standard least squares regression is highly sensitive to outliers because the squared error loss amplifies the influence of extreme observations. Robust regression methods mitigate this by either down-weighting or excluding outliers during model fitting. These techniques solve the problem of obtaining reliable regression estimates when data contains contaminated observations, leverage points, or heavy-tailed error distributions. They occupy a critical niche between ordinary linear models and fully non-parametric approaches.

Usage

Use robust regression when the dataset is suspected to contain outliers that would distort ordinary least squares estimates. RANSAC is preferred when a large fraction of data may be outliers (it explicitly separates inliers from outliers). Huber regression is appropriate when outlier contamination is moderate and you want a smooth transition between squared loss and absolute loss. Theil-Sen is useful for datasets with up to approximately 29.3% contamination and is especially effective in low-dimensional settings due to its high breakdown point.

Theoretical Basis

RANSAC (RANdom SAmple Consensus) is an iterative algorithm:

Randomly select a minimal subset of data points.
Fit a model to this subset.
Count the number of data points (inliers) within a residual threshold $ε$ .
Repeat for a fixed number of iterations and keep the model with the most inliers.
Refit the model using all identified inliers.

The probability of finding an outlier-free sample in $k$ iterations is:

$1 - (1 - (1 - e)^{n})^{k}$

where $e$ is the outlier ratio and $n$ is the sample size.

Huber Regression uses the Huber loss function:

$L_{δ} (r) = {\begin{cases} \frac{1}{2} r^{2} & if | r | \leq δ \\ δ | r | - \frac{1}{2} δ^{2} & if | r | > δ \end{cases}$

This loss is quadratic for small residuals and linear for large residuals, providing a smooth compromise between $ℓ_{2}$ and $ℓ_{1}$ loss. The parameter $δ$ controls the threshold between the two regimes.

Theil-Sen Estimator computes the slope as the median of all pairwise slopes:

$\hat{β} = median {\frac{y_{j} - y_{i}}{x_{j} - x_{i}} : i < j, x_{i} \neq x_{j}}$

In higher dimensions, it generalizes by computing slopes over random subsamples. The median operation gives it a breakdown point of approximately 29.3%, meaning it tolerates up to that fraction of arbitrary outliers.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment