Implementation:Cleanlab Cleanlab CleanLearning Find Label Issues
| Field | Value |
|---|---|
| Sources | Confident Learning, Cleanlab |
| Domains | Machine_Learning, Data_Quality |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
CleanLearning.find_label_issues performs end-to-end label issue detection by combining cross-validation, confident learning, and quality scoring into a single method call.
Description
The find_label_issues method orchestrates the full label issue detection pipeline within the CleanLearning wrapper. It accepts training features and labels, optionally pre-computed predicted probabilities, and returns a structured DataFrame identifying which examples are likely mislabeled.
The method proceeds through the following internal stages:
- Out-of-sample prediction estimation: If
pred_probsis not provided, the method performs stratified K-fold cross-validation (usingcv_n_foldsfrom initialization) to compute out-of-sample predicted probabilities. Each fold trains the wrapped classifier on the training portion and predicts on the held-out portion. - Confident joint estimation: The method estimates the confident joint matrix using
cleanlab.count.compute_confident_joint, which counts examples that are confidently assigned to each (given_label, true_label) pair based on per-class thresholds. - Noise matrix computation: From the confident joint, the noise matrix (probability of given label given true label) and inverse noise matrix (probability of true label given observed label) are derived.
- Label issue identification: The method calls
cleanlab.filter.find_label_issueswith the configured filter strategy to produce a boolean mask of detected label issues. - Quality score computation: Each example receives a label quality score via
cleanlab.rank.get_label_quality_scores, providing a continuous measure of label trustworthiness. - Result assembly: All outputs are combined into a single
pd.DataFrameindexed to match the input data.
The method also stores intermediate results on the instance: self.confident_joint, self.noise_matrix, self.inverse_noise_matrix, and self.pred_probs.
Usage
Call find_label_issues on a CleanLearning instance to detect mislabeled examples. This can be used standalone for data auditing or as a precursor to fit().
from cleanlab.classification import CleanLearning
from sklearn.linear_model import LogisticRegression
cl = CleanLearning(clf=LogisticRegression())
label_issues_df = cl.find_label_issues(X, labels)
# Filter to only mislabeled examples, sorted by quality
mislabeled = label_issues_df[label_issues_df["is_label_issue"]].sort_values("label_quality")
print(f"Found {len(mislabeled)} label issues")
Code Reference
Source Location
- Repository
cleanlab/cleanlab- File
cleanlab/classification.py- Lines
- 675--947
Signature
def find_label_issues(
self,
X=None,
labels=None,
*,
pred_probs=None,
thresholds=None,
noise_matrix=None,
inverse_noise_matrix=None,
save_space=False,
clf_kwargs={},
validation_func=None,
) -> pd.DataFrame
Import
from cleanlab.classification import CleanLearning
# find_label_issues is a method of a CleanLearning instance
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
X |
array-like (N, M) | Conditional | Feature matrix. Required if pred_probs is not provided.
|
labels |
np.ndarray (N,) | Yes | Array of given (potentially noisy) integer class labels. |
pred_probs |
Optional[np.ndarray] (N, K) | No | Pre-computed out-of-sample predicted probabilities. If provided, cross-validation is skipped. |
thresholds |
Optional[np.ndarray] (K,) | No | Per-class thresholds for confident learning. Auto-computed if not provided. |
noise_matrix |
Optional[np.ndarray] (K, K) | No | Pre-computed noise matrix. Estimated from data if not provided. |
inverse_noise_matrix |
Optional[np.ndarray] (K, K) | No | Pre-computed inverse noise matrix. Estimated from data if not provided. |
save_space |
bool | No | If True, deletes intermediate data to reduce memory usage. |
clf_kwargs |
dict | No | Additional keyword arguments passed to the classifier's fit() during cross-validation.
|
validation_func |
Optional[callable] | No | Optional validation function called after cross-validation to verify model quality. |
Outputs
| Column | Type | Description |
|---|---|---|
is_label_issue |
bool | Whether the example is identified as having a label issue. |
label_quality |
float | Quality score between 0 and 1. Lower values indicate more likely label issues. |
given_label |
int | The original (potentially noisy) label provided in the input. |
predicted_label |
int | The label predicted by the model based on the features. |
The returned pd.DataFrame has one row per input example, indexed to match the input data.
Usage Examples
Basic Label Issue Detection
from cleanlab.classification import CleanLearning
from sklearn.ensemble import RandomForestClassifier
import numpy as np
cl = CleanLearning(clf=RandomForestClassifier(n_estimators=100), seed=42)
issues_df = cl.find_label_issues(X_train, labels=y_train)
# Count label issues
n_issues = issues_df["is_label_issue"].sum()
print(f"Detected {n_issues} label issues out of {len(y_train)} examples")
With Pre-computed Predicted Probabilities
from cleanlab.classification import CleanLearning
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
# Compute pred_probs separately
clf = LogisticRegression()
pred_probs = cross_val_predict(clf, X_train, y_train, cv=5, method="predict_proba")
# Use pre-computed pred_probs (skips internal cross-validation)
cl = CleanLearning()
issues_df = cl.find_label_issues(X_train, labels=y_train, pred_probs=pred_probs)