Implementation:Cleanlab Cleanlab Regression CleanLearning

Knowledge Sources	Cleanlab
Domains	Machine Learning, Data Quality, Regression, Noise-Robust Learning
Last Updated	2026-02-09 00:00 GMT

Overview

The CleanLearning class wraps any scikit-learn-compatible regression model to enable robust training on datasets with noisy (corrupted) target values by automatically detecting and removing examples with label issues.

Description

CleanLearning in cleanlab.regression.learn implements a complete pipeline for noise-robust regression. It wraps any sklearn-compatible regression estimator and follows a three-phase approach:

Label Issue Detection (via find_label_issues): Uses cross-validation to obtain out-of-fold predictions, computes residuals, and performs a two-stage grid search (coarse then fine) to find the optimal fraction k of data to exclude, maximizing the R-squared score. Uncertainty estimation combines epistemic uncertainty (bootstrapped variance of predictions) and aleatoric uncertainty (variance of residual predictions). Label quality scores are computed as exp(-|adjusted_residual| / median), where adjusted residuals are normalized by total uncertainty.

Robust Training (via fit): Identifies label issues (or accepts pre-computed ones), prunes flagged examples from the training set, and trains the final model on the remaining clean data.

Inference (via predict and score): Delegates directly to the wrapped model for predictions and evaluation.

The class also exposes get_epistemic_uncertainty and get_aleatoric_uncertainty as standalone methods, and provides save_space to free memory by deleting stored label issue DataFrames.

Usage

Import CleanLearning when you have a regression task with potentially noisy target values and want to: (1) automatically detect which examples have corrupted y-values, (2) train a model that is robust to label noise by excluding detected issues, or (3) estimate per-example label quality scores. The wrapped model must implement fit(), predict(), and optionally score().

Code Reference

Source Location

Repository: Cleanlab
File: cleanlab/regression/learn.py
Lines: 1-872

Signature

class CleanLearning(BaseEstimator):
    def __init__(
        self,
        model: Optional[BaseEstimator] = None,
        *,
        cv_n_folds: int = 5,
        n_boot: int = 5,
        include_aleatoric_uncertainty: bool = True,
        verbose: bool = False,
        seed: Optional[bool] = None,
    )

    def fit(
        self,
        X: Union[np.ndarray, pd.DataFrame],
        y: LabelLike,
        *,
        label_issues: Optional[Union[pd.DataFrame, np.ndarray]] = None,
        sample_weight: Optional[np.ndarray] = None,
        find_label_issues_kwargs: Optional[dict] = None,
        model_kwargs: Optional[dict] = None,
        model_final_kwargs: Optional[dict] = None,
    ) -> BaseEstimator

    def predict(self, X: np.ndarray, *args, **kwargs) -> np.ndarray

    def score(
        self,
        X: Union[np.ndarray, pd.DataFrame],
        y: LabelLike,
        sample_weight: Optional[np.ndarray] = None,
    ) -> float

    def find_label_issues(
        self,
        X: Union[np.ndarray, pd.DataFrame],
        y: LabelLike,
        *,
        uncertainty: Optional[Union[np.ndarray, float]] = None,
        coarse_search_range: list = [0.01, 0.05, 0.1, 0.15, 0.2],
        fine_search_size: int = 3,
        save_space: bool = False,
        model_kwargs: Optional[dict] = None,
    ) -> pd.DataFrame

Import

from cleanlab.regression.learn import CleanLearning

I/O Contract

Constructor Inputs

Name	Type	Required	Description
model	BaseEstimator	No	Any sklearn-compatible regression model with fit() and predict(). Defaults to LinearRegression.
cv_n_folds	int	No	Number of cross-validation folds for out-of-sample predictions (default 5). Must be at least 2.
n_boot	int	No	Number of bootstrap resampling rounds for epistemic uncertainty estimation (default 5). Set to 0 to skip.
include_aleatoric_uncertainty	bool	No	Whether to estimate aleatoric uncertainty during issue detection (default True).
verbose	bool	No	Controls output verbosity (default False).
seed	int	No	Random seed for reproducibility.

fit() Inputs

Name	Type	Required	Description
X	np.ndarray or pd.DataFrame	Yes	Feature matrix of shape (N, ...).
y	LabelLike	Yes	Target values of shape (N,), some of which may be corrupted.
label_issues	pd.DataFrame or np.ndarray	No	Pre-computed label issues. If DataFrame, must contain 'is_label_issue' column. If array, must be boolean mask.
sample_weight	np.ndarray	No	Per-example weights of shape (N,) for the loss function.
find_label_issues_kwargs	dict	No	Extra keyword arguments for find_label_issues.
model_kwargs	dict	No	Keyword arguments passed to model.fit() in all calls.
model_final_kwargs	dict	No	Extra keyword arguments for the final model.fit() on clean data only.

fit() Output

Name	Type	Description
self	CleanLearning	The fitted estimator. Also stores self.label_issues_df (pd.DataFrame) and self.label_issues_mask (np.ndarray).

find_label_issues() Output

Name	Type	Description
label_issues_df	pd.DataFrame	DataFrame with columns: is_label_issue (bool), label_quality (float, 0-1), given_label (original y), predicted_label (model prediction).

predict() Output

Name	Type	Description
predictions	np.ndarray	Predicted target values from the cleaned model.

score() Output

Name	Type	Description
score	float	Model performance score on test data (uses model's score method or R-squared).

Usage Examples

Basic Usage

from cleanlab.regression.learn import CleanLearning
from sklearn.linear_model import LinearRegression
import numpy as np

# Create noisy data
np.random.seed(42)
X = np.random.randn(200, 5)
y_true = X @ np.array([1, 2, 0, -1, 0.5])
y_noisy = y_true + np.random.randn(200) * 0.5
# Corrupt 10% of labels
corrupt_idx = np.random.choice(200, 20, replace=False)
y_noisy[corrupt_idx] += np.random.randn(20) * 10

# Train with CleanLearning
cl = CleanLearning(clf=LinearRegression())
cl.fit(X, y_noisy)

# Predict as if trained on clean data
predictions = cl.predict(X)

Find Label Issues Only

cl = CleanLearning(model=LinearRegression(), verbose=True)
label_issues_df = cl.find_label_issues(X, y_noisy)

# Inspect flagged issues
flagged = label_issues_df[label_issues_df["is_label_issue"]]
print(f"Found {len(flagged)} label issues")
print(flagged.sort_values("label_quality").head(10))

Custom Model with Uncertainty

from sklearn.ensemble import GradientBoostingRegressor

cl = CleanLearning(
    model=GradientBoostingRegressor(),
    cv_n_folds=10,
    n_boot=10,
    include_aleatoric_uncertainty=True,
)
cl.fit(X, y_noisy)
print("Score on noisy data:", cl.score(X, y_noisy))

Related Pages

Principle:Cleanlab_Cleanlab_Regression_Noise_Robust_Training

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment