Overview
The CleanLearning class wraps any scikit-learn-compatible regression model to enable robust training on datasets with noisy (corrupted) target values by automatically detecting and removing examples with label issues.
Description
CleanLearning in cleanlab.regression.learn implements a complete pipeline for noise-robust regression. It wraps any sklearn-compatible regression estimator and follows a three-phase approach:
- Label Issue Detection (via find_label_issues): Uses cross-validation to obtain out-of-fold predictions, computes residuals, and performs a two-stage grid search (coarse then fine) to find the optimal fraction k of data to exclude, maximizing the R-squared score. Uncertainty estimation combines epistemic uncertainty (bootstrapped variance of predictions) and aleatoric uncertainty (variance of residual predictions). Label quality scores are computed as
exp(-|adjusted_residual| / median), where adjusted residuals are normalized by total uncertainty.
- Robust Training (via fit): Identifies label issues (or accepts pre-computed ones), prunes flagged examples from the training set, and trains the final model on the remaining clean data.
- Inference (via predict and score): Delegates directly to the wrapped model for predictions and evaluation.
The class also exposes get_epistemic_uncertainty and get_aleatoric_uncertainty as standalone methods, and provides save_space to free memory by deleting stored label issue DataFrames.
Usage
Import CleanLearning when you have a regression task with potentially noisy target values and want to: (1) automatically detect which examples have corrupted y-values, (2) train a model that is robust to label noise by excluding detected issues, or (3) estimate per-example label quality scores. The wrapped model must implement fit(), predict(), and optionally score().
Code Reference
Source Location
- Repository: Cleanlab
- File: cleanlab/regression/learn.py
- Lines: 1-872
Signature
class CleanLearning(BaseEstimator):
def __init__(
self,
model: Optional[BaseEstimator] = None,
*,
cv_n_folds: int = 5,
n_boot: int = 5,
include_aleatoric_uncertainty: bool = True,
verbose: bool = False,
seed: Optional[bool] = None,
)
def fit(
self,
X: Union[np.ndarray, pd.DataFrame],
y: LabelLike,
*,
label_issues: Optional[Union[pd.DataFrame, np.ndarray]] = None,
sample_weight: Optional[np.ndarray] = None,
find_label_issues_kwargs: Optional[dict] = None,
model_kwargs: Optional[dict] = None,
model_final_kwargs: Optional[dict] = None,
) -> BaseEstimator
def predict(self, X: np.ndarray, *args, **kwargs) -> np.ndarray
def score(
self,
X: Union[np.ndarray, pd.DataFrame],
y: LabelLike,
sample_weight: Optional[np.ndarray] = None,
) -> float
def find_label_issues(
self,
X: Union[np.ndarray, pd.DataFrame],
y: LabelLike,
*,
uncertainty: Optional[Union[np.ndarray, float]] = None,
coarse_search_range: list = [0.01, 0.05, 0.1, 0.15, 0.2],
fine_search_size: int = 3,
save_space: bool = False,
model_kwargs: Optional[dict] = None,
) -> pd.DataFrame
Import
from cleanlab.regression.learn import CleanLearning
I/O Contract
Constructor Inputs
| Name |
Type |
Required |
Description
|
| model |
BaseEstimator |
No |
Any sklearn-compatible regression model with fit() and predict(). Defaults to LinearRegression.
|
| cv_n_folds |
int |
No |
Number of cross-validation folds for out-of-sample predictions (default 5). Must be at least 2.
|
| n_boot |
int |
No |
Number of bootstrap resampling rounds for epistemic uncertainty estimation (default 5). Set to 0 to skip.
|
| include_aleatoric_uncertainty |
bool |
No |
Whether to estimate aleatoric uncertainty during issue detection (default True).
|
| verbose |
bool |
No |
Controls output verbosity (default False).
|
| seed |
int |
No |
Random seed for reproducibility.
|
fit() Inputs
| Name |
Type |
Required |
Description
|
| X |
np.ndarray or pd.DataFrame |
Yes |
Feature matrix of shape (N, ...).
|
| y |
LabelLike |
Yes |
Target values of shape (N,), some of which may be corrupted.
|
| label_issues |
pd.DataFrame or np.ndarray |
No |
Pre-computed label issues. If DataFrame, must contain 'is_label_issue' column. If array, must be boolean mask.
|
| sample_weight |
np.ndarray |
No |
Per-example weights of shape (N,) for the loss function.
|
| find_label_issues_kwargs |
dict |
No |
Extra keyword arguments for find_label_issues.
|
| model_kwargs |
dict |
No |
Keyword arguments passed to model.fit() in all calls.
|
| model_final_kwargs |
dict |
No |
Extra keyword arguments for the final model.fit() on clean data only.
|
fit() Output
| Name |
Type |
Description
|
| self |
CleanLearning |
The fitted estimator. Also stores self.label_issues_df (pd.DataFrame) and self.label_issues_mask (np.ndarray).
|
find_label_issues() Output
| Name |
Type |
Description
|
| label_issues_df |
pd.DataFrame |
DataFrame with columns: is_label_issue (bool), label_quality (float, 0-1), given_label (original y), predicted_label (model prediction).
|
predict() Output
| Name |
Type |
Description
|
| predictions |
np.ndarray |
Predicted target values from the cleaned model.
|
score() Output
| Name |
Type |
Description
|
| score |
float |
Model performance score on test data (uses model's score method or R-squared).
|
Usage Examples
Basic Usage
from cleanlab.regression.learn import CleanLearning
from sklearn.linear_model import LinearRegression
import numpy as np
# Create noisy data
np.random.seed(42)
X = np.random.randn(200, 5)
y_true = X @ np.array([1, 2, 0, -1, 0.5])
y_noisy = y_true + np.random.randn(200) * 0.5
# Corrupt 10% of labels
corrupt_idx = np.random.choice(200, 20, replace=False)
y_noisy[corrupt_idx] += np.random.randn(20) * 10
# Train with CleanLearning
cl = CleanLearning(clf=LinearRegression())
cl.fit(X, y_noisy)
# Predict as if trained on clean data
predictions = cl.predict(X)
Find Label Issues Only
cl = CleanLearning(model=LinearRegression(), verbose=True)
label_issues_df = cl.find_label_issues(X, y_noisy)
# Inspect flagged issues
flagged = label_issues_df[label_issues_df["is_label_issue"]]
print(f"Found {len(flagged)} label issues")
print(flagged.sort_values("label_quality").head(10))
Custom Model with Uncertainty
from sklearn.ensemble import GradientBoostingRegressor
cl = CleanLearning(
model=GradientBoostingRegressor(),
cv_n_folds=10,
n_boot=10,
include_aleatoric_uncertainty=True,
)
cl.fit(X, y_noisy)
print("Score on noisy data:", cl.score(X, y_noisy))
Related Pages