Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Cleanlab Cleanlab Regression CleanLearning

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Data Quality, Regression, Noise-Robust Learning
Last Updated 2026-02-09 00:00 GMT

Overview

The CleanLearning class wraps any scikit-learn-compatible regression model to enable robust training on datasets with noisy (corrupted) target values by automatically detecting and removing examples with label issues.

Description

CleanLearning in cleanlab.regression.learn implements a complete pipeline for noise-robust regression. It wraps any sklearn-compatible regression estimator and follows a three-phase approach:

  1. Label Issue Detection (via find_label_issues): Uses cross-validation to obtain out-of-fold predictions, computes residuals, and performs a two-stage grid search (coarse then fine) to find the optimal fraction k of data to exclude, maximizing the R-squared score. Uncertainty estimation combines epistemic uncertainty (bootstrapped variance of predictions) and aleatoric uncertainty (variance of residual predictions). Label quality scores are computed as exp(-|adjusted_residual| / median), where adjusted residuals are normalized by total uncertainty.
  1. Robust Training (via fit): Identifies label issues (or accepts pre-computed ones), prunes flagged examples from the training set, and trains the final model on the remaining clean data.
  1. Inference (via predict and score): Delegates directly to the wrapped model for predictions and evaluation.

The class also exposes get_epistemic_uncertainty and get_aleatoric_uncertainty as standalone methods, and provides save_space to free memory by deleting stored label issue DataFrames.

Usage

Import CleanLearning when you have a regression task with potentially noisy target values and want to: (1) automatically detect which examples have corrupted y-values, (2) train a model that is robust to label noise by excluding detected issues, or (3) estimate per-example label quality scores. The wrapped model must implement fit(), predict(), and optionally score().

Code Reference

Source Location

  • Repository: Cleanlab
  • File: cleanlab/regression/learn.py
  • Lines: 1-872

Signature

class CleanLearning(BaseEstimator):
    def __init__(
        self,
        model: Optional[BaseEstimator] = None,
        *,
        cv_n_folds: int = 5,
        n_boot: int = 5,
        include_aleatoric_uncertainty: bool = True,
        verbose: bool = False,
        seed: Optional[bool] = None,
    )

    def fit(
        self,
        X: Union[np.ndarray, pd.DataFrame],
        y: LabelLike,
        *,
        label_issues: Optional[Union[pd.DataFrame, np.ndarray]] = None,
        sample_weight: Optional[np.ndarray] = None,
        find_label_issues_kwargs: Optional[dict] = None,
        model_kwargs: Optional[dict] = None,
        model_final_kwargs: Optional[dict] = None,
    ) -> BaseEstimator

    def predict(self, X: np.ndarray, *args, **kwargs) -> np.ndarray

    def score(
        self,
        X: Union[np.ndarray, pd.DataFrame],
        y: LabelLike,
        sample_weight: Optional[np.ndarray] = None,
    ) -> float

    def find_label_issues(
        self,
        X: Union[np.ndarray, pd.DataFrame],
        y: LabelLike,
        *,
        uncertainty: Optional[Union[np.ndarray, float]] = None,
        coarse_search_range: list = [0.01, 0.05, 0.1, 0.15, 0.2],
        fine_search_size: int = 3,
        save_space: bool = False,
        model_kwargs: Optional[dict] = None,
    ) -> pd.DataFrame

Import

from cleanlab.regression.learn import CleanLearning

I/O Contract

Constructor Inputs

Name Type Required Description
model BaseEstimator No Any sklearn-compatible regression model with fit() and predict(). Defaults to LinearRegression.
cv_n_folds int No Number of cross-validation folds for out-of-sample predictions (default 5). Must be at least 2.
n_boot int No Number of bootstrap resampling rounds for epistemic uncertainty estimation (default 5). Set to 0 to skip.
include_aleatoric_uncertainty bool No Whether to estimate aleatoric uncertainty during issue detection (default True).
verbose bool No Controls output verbosity (default False).
seed int No Random seed for reproducibility.

fit() Inputs

Name Type Required Description
X np.ndarray or pd.DataFrame Yes Feature matrix of shape (N, ...).
y LabelLike Yes Target values of shape (N,), some of which may be corrupted.
label_issues pd.DataFrame or np.ndarray No Pre-computed label issues. If DataFrame, must contain 'is_label_issue' column. If array, must be boolean mask.
sample_weight np.ndarray No Per-example weights of shape (N,) for the loss function.
find_label_issues_kwargs dict No Extra keyword arguments for find_label_issues.
model_kwargs dict No Keyword arguments passed to model.fit() in all calls.
model_final_kwargs dict No Extra keyword arguments for the final model.fit() on clean data only.

fit() Output

Name Type Description
self CleanLearning The fitted estimator. Also stores self.label_issues_df (pd.DataFrame) and self.label_issues_mask (np.ndarray).

find_label_issues() Output

Name Type Description
label_issues_df pd.DataFrame DataFrame with columns: is_label_issue (bool), label_quality (float, 0-1), given_label (original y), predicted_label (model prediction).

predict() Output

Name Type Description
predictions np.ndarray Predicted target values from the cleaned model.

score() Output

Name Type Description
score float Model performance score on test data (uses model's score method or R-squared).

Usage Examples

Basic Usage

from cleanlab.regression.learn import CleanLearning
from sklearn.linear_model import LinearRegression
import numpy as np

# Create noisy data
np.random.seed(42)
X = np.random.randn(200, 5)
y_true = X @ np.array([1, 2, 0, -1, 0.5])
y_noisy = y_true + np.random.randn(200) * 0.5
# Corrupt 10% of labels
corrupt_idx = np.random.choice(200, 20, replace=False)
y_noisy[corrupt_idx] += np.random.randn(20) * 10

# Train with CleanLearning
cl = CleanLearning(clf=LinearRegression())
cl.fit(X, y_noisy)

# Predict as if trained on clean data
predictions = cl.predict(X)

Find Label Issues Only

cl = CleanLearning(model=LinearRegression(), verbose=True)
label_issues_df = cl.find_label_issues(X, y_noisy)

# Inspect flagged issues
flagged = label_issues_df[label_issues_df["is_label_issue"]]
print(f"Found {len(flagged)} label issues")
print(flagged.sort_values("label_quality").head(10))

Custom Model with Uncertainty

from sklearn.ensemble import GradientBoostingRegressor

cl = CleanLearning(
    model=GradientBoostingRegressor(),
    cv_n_folds=10,
    n_boot=10,
    include_aleatoric_uncertainty=True,
)
cl.fit(X, y_noisy)
print("Score on noisy data:", cl.score(X, y_noisy))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment