Principle:Cleanlab Cleanlab CleanLearning Initialization

Metadata
Sources	Confident Learning, Cleanlab
Domains	Machine_Learning, Data_Quality
Last Updated	2026-02-09 12:00 GMT

Overview

Configuration of a noise-robust learning wrapper that enhances any scikit-learn compatible classifier with automatic label issue detection and data cleaning capabilities.

Description

CleanLearning initialization wraps an sklearn-compatible classifier with cleanlab's confident learning pipeline. It configures cross-validation parameters, label issue detection settings, and quality scoring methods. The wrapper implements sklearn's BaseEstimator interface (fit/predict/predict_proba/score) so it can be used as a drop-in replacement for any sklearn classifier while automatically handling noisy labels.

The initialization step establishes several key configuration parameters:

Base classifier (clf): Any sklearn-compatible estimator. Defaults to LogisticRegression if not specified.
Cross-validation folds (cv_n_folds): Number of stratified K-fold splits used to generate out-of-sample predicted probabilities. Defaults to 5.
Label issue detection kwargs: Dictionary of parameters forwarded to filter.find_label_issues, controlling the filter strategy (e.g., prune_by_noise_rate, prune_by_class, both, confident_learning).
Label quality scores kwargs: Dictionary of parameters forwarded to rank.get_label_quality_scores, controlling how examples are scored for label quality.
PU learning (pulearning): Optional integer specifying the positive class index for positive-unlabeled learning scenarios.
Memory management (low_memory): When enabled, reduces memory usage during cross-validation at the cost of runtime.

Because CleanLearning inherits from BaseEstimator, it participates in sklearn's ecosystem: it can be used inside pipelines, with grid search, and with any tool that expects the sklearn estimator interface.

Usage

Use when you want to train a classifier that is robust to noisy labels without manually running the label cleaning pipeline. CleanLearning is the recommended entry point when you have tabular data, an sklearn classifier, and suspect that some labels may be incorrect.

from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier

# Wrap any sklearn classifier with noise-robust training
cl = CleanLearning(
    clf=GradientBoostingClassifier(),
    cv_n_folds=5,
    find_label_issues_kwargs={"filter_by": "prune_by_noise_rate"},
    verbose=True,
)

Theoretical Basis

Wrapper pattern: Delegate classification to an underlying estimator while intercepting the training pipeline to insert label cleaning steps. Configure the pipeline parameters upfront: number of CV folds for pred_probs estimation, filter strategy for label issue detection, and quality scoring method for prioritization.

The wrapper pattern ensures that the label cleaning logic is transparent to downstream consumers. Any code that accepts an sklearn estimator can accept a CleanLearning instance without modification, because the interface contract is preserved. The cleaning logic is injected solely within the fit() method, while predict() and predict_proba() pass through directly to the wrapped classifier.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment