Workflow:Cleanlab Cleanlab CleanLearning Robust Training
| Knowledge Sources | |
|---|---|
| Domains | Data_Centric_AI, Classification, Robust_Training |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
End-to-end process for training a robust classifier on noisy labeled data using cleanlab's CleanLearning wrapper.
Description
This workflow uses cleanlab's CleanLearning class to wrap any scikit-learn-compatible classifier and automate the entire label-cleaning pipeline: cross-validation for out-of-sample predictions, label issue detection, removal of mislabeled examples, and retraining on the cleaned dataset. CleanLearning extends sklearn's BaseEstimator interface, so it integrates seamlessly into existing ML pipelines. The result is a model that performs as if it had been trained on correctly labeled data, without requiring manual data cleaning.
Usage
Execute this workflow when you have a classification dataset with potentially noisy labels and want to train a model that is robust to label errors with minimal effort. This is appropriate when you want a single high-level API that handles the entire process (detect issues, clean data, retrain) rather than manually orchestrating the low-level count/filter/rank pipeline. Your classifier must follow the scikit-learn estimator API (fit, predict, predict_proba, score). For non-sklearn models, use adapter libraries like skorch (PyTorch) or SciKeras (Keras).
Execution Steps
Step 1: Prepare Classifier and Data
Select an sklearn-compatible classifier and prepare your feature matrix X and noisy label array y. The classifier must implement fit, predict, predict_proba, and score methods. Ensure the classifier is properly clonable via sklearn.base.clone, as CleanLearning creates multiple instances internally during cross-validation.
Key considerations:
- Labels must be integers in 0, 1, ..., K-1 where K is the number of classes
- The classifier should support sample_weight in its fit method for optimal results (optional but recommended)
- For PyTorch models, use skorch to wrap them as sklearn estimators
- For Keras models, use SciKeras to wrap them as sklearn estimators
- Neural network weights should be initialized inside fit(), not __init__()
Step 2: Initialize CleanLearning
Create a CleanLearning instance by passing your base classifier. Optionally configure parameters such as the cross-validation strategy (cv_n_folds), the filtering method for label issue detection, and the label quality scoring method.
Key considerations:
- Default cross-validation uses 5-fold stratified splitting
- The seed parameter controls reproducibility of cross-validation splits
- Verbose mode provides progress information during the pipeline
- The find_label_issues_kwargs parameter allows fine-tuning of the issue detection stage
Step 3: Find Label Issues
Call find_label_issues with your data and labels to identify mislabeled examples. This internally runs cross-validation to produce out-of-sample predicted probabilities, estimates the confident joint, and applies the configured filtering strategy to identify issues. Returns a DataFrame with per-example label quality scores and issue flags.
Key considerations:
- You can optionally pass pre-computed pred_probs to skip cross-validation
- A pre-computed confident joint (thresholds) can also be provided
- The returned DataFrame contains columns for predicted labels, label quality scores, and issue indicators
- This step does not modify the model or data; it only identifies issues
Step 4: Fit on Cleaned Data
Call fit with your data and labels. This method internally runs find_label_issues (if not already done), removes the detected mislabeled examples from the training set, and retrains the classifier on the cleaned subset. The resulting model should perform better than one trained on the full noisy dataset.
Key considerations:
- Alternatively, fit can use sample weighting instead of removing issues, controlled by the label_issues_mask parameter
- The confident joint and noise matrices are stored as attributes after fitting
- The cleaned model is accessible via standard sklearn predict/predict_proba/score methods
- Dataset-level statistics (number of issues, noise rates) are stored for inspection
Step 5: Evaluate the Robust Model
Use the trained CleanLearning model to make predictions on test data. Compare performance against a baseline model trained on the uncleaned data. Inspect stored attributes like the confident joint, noise matrices, and per-example label quality scores to understand the noise structure in your data.
Key considerations:
- Use predict and predict_proba for inference, same as any sklearn estimator
- The confident_joint attribute reveals the estimated noise structure
- The label_issues_df attribute contains detailed per-example diagnostics
- Compare accuracy, F1, and other metrics against a baseline to quantify improvement