Principle:Cleanlab Cleanlab CleanLearning Initialization
| Metadata | |
|---|---|
| Sources | Confident Learning, Cleanlab |
| Domains | Machine_Learning, Data_Quality |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Configuration of a noise-robust learning wrapper that enhances any scikit-learn compatible classifier with automatic label issue detection and data cleaning capabilities.
Description
CleanLearning initialization wraps an sklearn-compatible classifier with cleanlab's confident learning pipeline. It configures cross-validation parameters, label issue detection settings, and quality scoring methods. The wrapper implements sklearn's BaseEstimator interface (fit/predict/predict_proba/score) so it can be used as a drop-in replacement for any sklearn classifier while automatically handling noisy labels.
The initialization step establishes several key configuration parameters:
- Base classifier (
clf): Any sklearn-compatible estimator. Defaults toLogisticRegressionif not specified. - Cross-validation folds (
cv_n_folds): Number of stratified K-fold splits used to generate out-of-sample predicted probabilities. Defaults to 5. - Label issue detection kwargs: Dictionary of parameters forwarded to
filter.find_label_issues, controlling the filter strategy (e.g.,prune_by_noise_rate,prune_by_class,both,confident_learning). - Label quality scores kwargs: Dictionary of parameters forwarded to
rank.get_label_quality_scores, controlling how examples are scored for label quality. - PU learning (
pulearning): Optional integer specifying the positive class index for positive-unlabeled learning scenarios. - Memory management (
low_memory): When enabled, reduces memory usage during cross-validation at the cost of runtime.
Because CleanLearning inherits from BaseEstimator, it participates in sklearn's ecosystem: it can be used inside pipelines, with grid search, and with any tool that expects the sklearn estimator interface.
Usage
Use when you want to train a classifier that is robust to noisy labels without manually running the label cleaning pipeline. CleanLearning is the recommended entry point when you have tabular data, an sklearn classifier, and suspect that some labels may be incorrect.
from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier
# Wrap any sklearn classifier with noise-robust training
cl = CleanLearning(
clf=GradientBoostingClassifier(),
cv_n_folds=5,
find_label_issues_kwargs={"filter_by": "prune_by_noise_rate"},
verbose=True,
)
Theoretical Basis
Wrapper pattern: Delegate classification to an underlying estimator while intercepting the training pipeline to insert label cleaning steps. Configure the pipeline parameters upfront: number of CV folds for pred_probs estimation, filter strategy for label issue detection, and quality scoring method for prioritization.
The wrapper pattern ensures that the label cleaning logic is transparent to downstream consumers. Any code that accepts an sklearn estimator can accept a CleanLearning instance without modification, because the interface contract is preserved. The cleaning logic is injected solely within the fit() method, while predict() and predict_proba() pass through directly to the wrapped classifier.