Principle:Cleanlab Cleanlab Integrated Label Issue Detection
| Metadata | |
|---|---|
| Sources | Confident Learning, Cleanlab |
| Domains | Machine_Learning, Data_Quality |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
End-to-end label issue detection that combines cross-validation, confident learning, and quality scoring into a single method call.
Description
Integrated label issue detection is CleanLearning's method for finding mislabeled examples. It internally performs cross-validation to obtain out-of-sample predictions (if not provided), estimates the confident joint, applies a filtering strategy to identify label issues, and computes quality scores for each example. All results are returned as a DataFrame with is_label_issue, label_quality, given_label, and predicted_label columns.
The pipeline proceeds in the following stages:
- Stage 1 -- Out-of-sample predictions: If
pred_probsare not provided, the method runs stratified K-fold cross-validation using the wrapped classifier. Each example receives a predicted probability vector from a model that was not trained on that example. This avoids overfitting bias that would occur if the model predicted on its own training data. - Stage 2 -- Confident joint estimation: Using the out-of-sample
pred_probsand the given labels, the method estimates the confident joint matrix C, which counts how many examples are confidently classified as having true label i while being given label j. From this, noise matrices are derived. - Stage 3 -- Label issue filtering: The configured filter strategy (e.g.,
prune_by_noise_rate,prune_by_class,both,confident_learning) is applied to identify which examples are likely mislabeled. - Stage 4 -- Quality scoring: Each example is assigned a label quality score between 0 and 1, where lower scores indicate higher likelihood of being mislabeled. This enables ranking and prioritization for human review.
- Stage 5 -- Result aggregation: All results are combined into a single
pd.DataFramewith boolean issue flags, quality scores, given labels, and predicted labels.
Usage
Use when you want to find label issues as part of a CleanLearning pipeline, especially before calling fit() to train on cleaned data. This method can also be used standalone for data auditing and label quality assessment.
from cleanlab.classification import CleanLearning
from sklearn.linear_model import LogisticRegression
cl = CleanLearning(clf=LogisticRegression())
label_issues_df = cl.find_label_issues(X, labels)
# Inspect the most likely mislabeled examples
worst_examples = label_issues_df.sort_values("label_quality").head(20)
Theoretical Basis
Pipeline composition:
- If
pred_probsnot provided, run stratified K-fold cross-validation to get out-of-sample predictions. - Estimate the confident joint Cij and derive noise matrices.
- Apply
filter.find_label_issueswith the configured strategy to produce booleanis_label_issueflags. - Compute
rank.get_label_quality_scoresfor each example. - Combine into a structured DataFrame.
The confident joint C is estimated by thresholding the predicted probabilities per class. For each example x with given label y = j, if the model's predicted probability for class i exceeds a class-specific threshold ti, the example is counted in Cij. The threshold ti is the average predicted probability of class i among examples given label i:
This ensures the thresholds adapt to each class's prediction confidence distribution.