Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Scikit learn Scikit learn Robust Regression

From Leeroopedia


Knowledge Sources
Domains Supervised Learning, Regression
Last Updated 2026-02-08 15:00 GMT

Overview

Robust regression comprises methods that are resistant to the influence of outliers and violations of standard distributional assumptions in the data.

Description

Standard least squares regression is highly sensitive to outliers because the squared error loss amplifies the influence of extreme observations. Robust regression methods mitigate this by either down-weighting or excluding outliers during model fitting. These techniques solve the problem of obtaining reliable regression estimates when data contains contaminated observations, leverage points, or heavy-tailed error distributions. They occupy a critical niche between ordinary linear models and fully non-parametric approaches.

Usage

Use robust regression when the dataset is suspected to contain outliers that would distort ordinary least squares estimates. RANSAC is preferred when a large fraction of data may be outliers (it explicitly separates inliers from outliers). Huber regression is appropriate when outlier contamination is moderate and you want a smooth transition between squared loss and absolute loss. Theil-Sen is useful for datasets with up to approximately 29.3% contamination and is especially effective in low-dimensional settings due to its high breakdown point.

Theoretical Basis

RANSAC (RANdom SAmple Consensus) is an iterative algorithm:

  1. Randomly select a minimal subset of data points.
  2. Fit a model to this subset.
  3. Count the number of data points (inliers) within a residual threshold ε.
  4. Repeat for a fixed number of iterations and keep the model with the most inliers.
  5. Refit the model using all identified inliers.

The probability of finding an outlier-free sample in k iterations is:

1(1(1e)n)k

where e is the outlier ratio and n is the sample size.

Huber Regression uses the Huber loss function:

Lδ(r)={12r2if |r|δδ|r|12δ2if |r|>δ

This loss is quadratic for small residuals and linear for large residuals, providing a smooth compromise between 2 and 1 loss. The parameter δ controls the threshold between the two regimes.

Theil-Sen Estimator computes the slope as the median of all pairwise slopes:

β^=median{yjyixjxi:i<j,xixj}

In higher dimensions, it generalizes by computing slopes over random subsamples. The median operation gives it a breakdown point of approximately 29.3%, meaning it tolerates up to that fraction of arbitrary outliers.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment