Principle:Cleanlab Cleanlab Noise Robust Training

Metadata
Sources	Confident Learning, Cleanlab
Domains	Machine_Learning, Data_Quality
Last Updated	2026-02-09 12:00 GMT

Overview

Training methodology that automatically detects and removes mislabeled examples before fitting a classifier, producing a model robust to label noise.

Description

Noise-robust training automates the full pipeline of label cleaning and retraining. It first identifies label issues (via cross-validation and confident learning), then removes or down-weights mislabeled examples, and finally retrains the classifier on the cleaned dataset. This produces a model that performs better than one trained on the original noisy data, as it has been shielded from learning incorrect label patterns.

The training pipeline implemented by CleanLearning.fit() proceeds as follows:

Step 1 -- Label issue detection: If label_issues are not provided, the method calls find_label_issues() internally. This runs cross-validation, estimates the confident joint, and identifies mislabeled examples.
Step 2 -- Data pruning: Examples flagged as label issues are removed from the training set. Their indices are stored in self.label_issues_df for later inspection. If sample_weight is provided and the classifier supports it, mislabeled examples can be assigned zero weight instead of being removed entirely.
Step 3 -- Final retraining: The wrapped classifier is fit on the cleaned (pruned) dataset. This final fit uses clf_final_kwargs if provided, allowing different hyperparameters for the final training compared to the cross-validation phase.
Step 4 -- State preservation: The fitted CleanLearning instance stores the label issues DataFrame, the estimated noise matrix, the inverse noise matrix, and the confident joint for post-hoc inspection.

The key insight is that removing mislabeled examples before training is often more effective than trying to make the model robust to noise during training, because the model never sees the corrupted labels.

Usage

Use when training a classifier on data that may contain labeling errors and you want the model to be robust to those errors without manually cleaning the data.

from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier

cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(X_train, labels=y_train)

# The model is now trained on cleaned data
# Inspect what was removed:
print(cl.label_issues_df[cl.label_issues_df["is_label_issue"]].shape[0], "issues found")

Theoretical Basis

Learning with noisy labels:

Detect mislabeled examples using confident learning. The confident joint C estimates the joint distribution of noisy (given) labels and true (latent) labels.
Remove detected label issues from the training set (or assign sample_weight=0 if supported by the classifier).
Retrain the base classifier on the pruned/reweighted dataset.

The noise-free training data leads to improved generalization. Formally, if the true risk is:

$R (f) = 𝔼_{(x, y^{*})} [ℓ (f (x), y^{*})]$

Training on noisy labels y (instead of true labels y*) optimizes a biased objective. By identifying and removing examples where y differs from y*, the training objective more closely approximates the true risk, leading to better generalization performance.

The effectiveness of this approach depends on the accuracy of the label issue detection step. Confident learning achieves high precision by using per-class thresholds calibrated to the model's confidence distribution, which means most removed examples are genuinely mislabeled.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment