Workflow:Cleanlab Cleanlab Classification Label Issue Detection

Knowledge Sources	Cleanlab Cleanlab Docs Confident Learning Label Quality Scoring
Domains	Data_Centric_AI, Classification, Label_Quality
Last Updated	2026-02-09 19:00 GMT

Overview

End-to-end process for detecting mislabeled examples in a multi-class classification dataset using cleanlab's low-level Confident Learning API.

Description

This workflow implements the core Confident Learning pipeline for identifying label errors in classification data. Given a set of noisy labels and model-predicted class probabilities (pred_probs), it estimates the joint distribution of true and given labels, identifies which examples are likely mislabeled, and ranks them by severity. The pipeline proceeds through three stages: statistical estimation of label noise structure (count), identification of label issues (filter), and quality scoring/ranking of individual examples (rank). This approach works with any classifier that can produce predicted probabilities.

Usage

Execute this workflow when you have a classification dataset with potentially noisy labels and a trained model that can produce out-of-sample predicted probabilities for each example. This is the foundational cleanlab workflow appropriate when you want fine-grained control over the label issue detection process, need to inspect intermediate statistics like the confident joint or noise matrices, or are working at scale and need direct access to the low-level API. For a simpler all-in-one experience, consider the Datalab Dataset Audit workflow instead.

Execution Steps

Step 1: Obtain Out_of_Sample Predicted Probabilities

Train your classifier using cross-validation to produce out-of-sample predicted probabilities (pred_probs) for every example in the dataset. Each example's pred_probs should come from a model that was not trained on that example. This is critical for accurate label issue detection because in-sample predictions are overconfident and mask errors.

Key considerations:

Use K-fold cross-validation (K=5 or K=10 recommended) so every example gets out-of-sample predictions
The pred_probs array should have shape (N, K) where N is number of examples and K is number of classes
Columns must be ordered to correspond to classes 0, 1, ..., K-1
The cleanlab library provides a helper via cross-validation in the count module if needed

Step 2: Estimate the Confident Joint

Compute the confident joint matrix, which estimates the joint distribution of noisy (given) labels and true labels. This K x K matrix counts how many examples with given label i are estimated to truly belong to class j. The algorithm uses per-class confidence thresholds (the average model self-confidence for examples in each class) to determine which examples are "confidently" assigned to each (given, true) pair.

Key considerations:

The confident joint is the statistical backbone of Confident Learning
It can be calibrated so rows sum to match the observed label distribution
From the confident joint, noise matrices P(given|true) and P(true|given) can be derived
The estimated prior p(true label) reveals class balance in the noise-free distribution

Step 3: Filter for Label Issues

Use the confident joint and predicted probabilities to identify which specific examples have label issues. Multiple filtering strategies are available, including pruning by class, pruning by noise rate, and a combined approach. The default "confident learning" method uses the confident joint to determine the number of issues per (given, true) class pair, then selects examples with the lowest label quality within each pair.

Key considerations:

Seven filtering strategies are available; the default "confident_learning" method is recommended
Multiprocessing support is available for large datasets
Returns a boolean mask or list of indices identifying problematic examples
The number of detected issues depends on the filtering method and can vary

Step 4: Rank Examples by Label Quality

Compute a continuous label quality score for each example, ranging from 0 (likely mislabeled) to 1 (likely correct). This enables prioritizing which examples to review first. Scoring methods include self-confidence (probability of the given label), normalized margin (gap between top two class probabilities), and confidence-weighted entropy.

Key considerations:

Self-confidence is the default and most intuitive scoring method
Normalized margin is useful when classes have imbalanced confidence levels
Scores can be adjusted using confident thresholds for better calibration
Use these scores to create a ranked review queue for human annotators

Step 5: Analyze Dataset Health

Generate dataset-level diagnostics including overall label accuracy estimates, per-class issue counts, and a health summary. This provides a bird's-eye view of data quality across the entire dataset and highlights which classes are most affected by label noise.

Key considerations:

Dataset health summary reports overall estimated label accuracy
Per-class analysis reveals which classes are most confused with each other
The confident joint heatmap visualizes the noise structure
These statistics inform whether the dataset needs targeted re-labeling or wholesale review

Execution Diagram

GitHub URL

Workflow Repository