Principle:Cleanlab Cleanlab Integrated Label Issue Detection

Metadata
Sources	Confident Learning, Cleanlab
Domains	Machine_Learning, Data_Quality
Last Updated	2026-02-09 12:00 GMT

Overview

End-to-end label issue detection that combines cross-validation, confident learning, and quality scoring into a single method call.

Description

Integrated label issue detection is CleanLearning's method for finding mislabeled examples. It internally performs cross-validation to obtain out-of-sample predictions (if not provided), estimates the confident joint, applies a filtering strategy to identify label issues, and computes quality scores for each example. All results are returned as a DataFrame with is_label_issue, label_quality, given_label, and predicted_label columns.

The pipeline proceeds in the following stages:

Stage 1 -- Out-of-sample predictions: If pred_probs are not provided, the method runs stratified K-fold cross-validation using the wrapped classifier. Each example receives a predicted probability vector from a model that was not trained on that example. This avoids overfitting bias that would occur if the model predicted on its own training data.
Stage 2 -- Confident joint estimation: Using the out-of-sample pred_probs and the given labels, the method estimates the confident joint matrix C, which counts how many examples are confidently classified as having true label i while being given label j. From this, noise matrices are derived.
Stage 3 -- Label issue filtering: The configured filter strategy (e.g., prune_by_noise_rate, prune_by_class, both, confident_learning) is applied to identify which examples are likely mislabeled.
Stage 4 -- Quality scoring: Each example is assigned a label quality score between 0 and 1, where lower scores indicate higher likelihood of being mislabeled. This enables ranking and prioritization for human review.
Stage 5 -- Result aggregation: All results are combined into a single pd.DataFrame with boolean issue flags, quality scores, given labels, and predicted labels.

Usage

Use when you want to find label issues as part of a CleanLearning pipeline, especially before calling fit() to train on cleaned data. This method can also be used standalone for data auditing and label quality assessment.

from cleanlab.classification import CleanLearning
from sklearn.linear_model import LogisticRegression

cl = CleanLearning(clf=LogisticRegression())
label_issues_df = cl.find_label_issues(X, labels)

# Inspect the most likely mislabeled examples
worst_examples = label_issues_df.sort_values("label_quality").head(20)

Theoretical Basis

Pipeline composition:

If pred_probs not provided, run stratified K-fold cross-validation to get out-of-sample predictions.
Estimate the confident joint C_ij and derive noise matrices.
Apply filter.find_label_issues with the configured strategy to produce boolean is_label_issue flags.
Compute rank.get_label_quality_scores for each example.
Combine into a structured DataFrame.

The confident joint C is estimated by thresholding the predicted probabilities per class. For each example x with given label y = j, if the model's predicted probability for class i exceeds a class-specific threshold t_i, the example is counted in C_ij. The threshold t_i is the average predicted probability of class i among examples given label i:

$t_{i} = \frac{1}{| X_{i} |} \sum_{x \in X_{i}} \hat{p} (y = i ∣ x)$

This ensures the thresholds adapt to each class's prediction confidence distribution.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment