Principle:Cleanlab Cleanlab Noise Robust Training
| Metadata | |
|---|---|
| Sources | Confident Learning, Cleanlab |
| Domains | Machine_Learning, Data_Quality |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Training methodology that automatically detects and removes mislabeled examples before fitting a classifier, producing a model robust to label noise.
Description
Noise-robust training automates the full pipeline of label cleaning and retraining. It first identifies label issues (via cross-validation and confident learning), then removes or down-weights mislabeled examples, and finally retrains the classifier on the cleaned dataset. This produces a model that performs better than one trained on the original noisy data, as it has been shielded from learning incorrect label patterns.
The training pipeline implemented by CleanLearning.fit() proceeds as follows:
- Step 1 -- Label issue detection: If
label_issuesare not provided, the method callsfind_label_issues()internally. This runs cross-validation, estimates the confident joint, and identifies mislabeled examples. - Step 2 -- Data pruning: Examples flagged as label issues are removed from the training set. Their indices are stored in
self.label_issues_dffor later inspection. Ifsample_weightis provided and the classifier supports it, mislabeled examples can be assigned zero weight instead of being removed entirely. - Step 3 -- Final retraining: The wrapped classifier is fit on the cleaned (pruned) dataset. This final fit uses
clf_final_kwargsif provided, allowing different hyperparameters for the final training compared to the cross-validation phase. - Step 4 -- State preservation: The fitted
CleanLearninginstance stores the label issues DataFrame, the estimated noise matrix, the inverse noise matrix, and the confident joint for post-hoc inspection.
The key insight is that removing mislabeled examples before training is often more effective than trying to make the model robust to noise during training, because the model never sees the corrupted labels.
Usage
Use when training a classifier on data that may contain labeling errors and you want the model to be robust to those errors without manually cleaning the data.
from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier
cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(X_train, labels=y_train)
# The model is now trained on cleaned data
# Inspect what was removed:
print(cl.label_issues_df[cl.label_issues_df["is_label_issue"]].shape[0], "issues found")
Theoretical Basis
Learning with noisy labels:
- Detect mislabeled examples using confident learning. The confident joint C estimates the joint distribution of noisy (given) labels and true (latent) labels.
- Remove detected label issues from the training set (or assign
sample_weight=0if supported by the classifier). - Retrain the base classifier on the pruned/reweighted dataset.
The noise-free training data leads to improved generalization. Formally, if the true risk is:
Training on noisy labels y (instead of true labels y*) optimizes a biased objective. By identifying and removing examples where y differs from y*, the training objective more closely approximates the true risk, leading to better generalization performance.
The effectiveness of this approach depends on the accuracy of the label issue detection step. Confident learning achieves high precision by using per-class thresholds calibrated to the model's confidence distribution, which means most removed examples are genuinely mislabeled.