Implementation:Scikit learn Scikit learn TheilSenRegressor
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Robust Regression |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for robust multivariate regression using the Theil-Sen estimator based on median of pairwise slopes provided by scikit-learn.
Description
TheilSenRegressor implements the Theil-Sen estimator, a robust multivariate regression model. The algorithm calculates least square solutions on subsets of size n_subsamples, then computes the spatial median (L1 median) of all solutions as the final estimate. This approach provides a high breakdown point (up to about 29.3% for large samples), meaning it can tolerate a significant fraction of outliers. The computational cost is managed by limiting the number of subpopulations considered via the max_subpopulation parameter.
Usage
Use TheilSenRegressor when you need a regression model that is highly robust to outliers, especially when you expect up to ~29% of data points to be outliers. It is more robust than HuberRegressor for datasets with a higher fraction of outliers, though computationally more expensive. It is commonly used in scientific data analysis where measurement errors or anomalous readings are expected.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/linear_model/_theil_sen.py
Signature
class TheilSenRegressor(RegressorMixin, LinearModel):
def __init__(
self,
*,
fit_intercept=True,
max_subpopulation=1e4,
n_subsamples=None,
max_iter=300,
tol=1e-3,
random_state=None,
n_jobs=None,
verbose=False,
):
Import
from sklearn.linear_model import TheilSenRegressor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| fit_intercept | bool | No | Whether to calculate the intercept (default=True) |
| max_subpopulation | int | No | Maximum stochastic subpopulation size for 'n choose k' subsets (default=1e4) |
| n_subsamples | int | No | Number of samples per subset; between n_features and n_samples (default=None, minimum for maximal robustness) |
| max_iter | int | No | Maximum iterations for spatial median calculation (default=300) |
| tol | float | No | Tolerance for spatial median convergence (default=1e-3) |
| random_state | int or RandomState | No | Random seed for reproducibility |
| n_jobs | int | No | Number of CPUs for parallel computation (default=None) |
| verbose | bool | No | Verbose mode during fitting (default=False) |
Outputs
| Name | Type | Description |
|---|---|---|
| coef_ | ndarray of shape (n_features,) | Estimated coefficients of the regression model |
| intercept_ | float | Estimated intercept of the regression model |
| breakdown_ | float | Approximate breakdown point of the estimator |
| n_iter_ | int | Number of iterations for spatial median computation |
| n_subpopulation_ | int | Number of combinations considered for random subsampling |
Usage Examples
Basic Usage
from sklearn.linear_model import TheilSenRegressor
from sklearn.datasets import make_regression
import numpy as np
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)
# Add outliers
y[:10] = np.random.RandomState(42).uniform(-500, 500, size=10)
model = TheilSenRegressor(random_state=42)
model.fit(X, y)
print("Breakdown point:", model.breakdown_)
print("Coefficients:", model.coef_)