Implementation:Cleanlab Cleanlab CleanLearning Find Label Issues

Field	Value
Sources	Confident Learning, Cleanlab
Domains	Machine_Learning, Data_Quality
Last Updated	2026-02-09 12:00 GMT

Overview

CleanLearning.find_label_issues performs end-to-end label issue detection by combining cross-validation, confident learning, and quality scoring into a single method call.

Description

The find_label_issues method orchestrates the full label issue detection pipeline within the CleanLearning wrapper. It accepts training features and labels, optionally pre-computed predicted probabilities, and returns a structured DataFrame identifying which examples are likely mislabeled.

The method proceeds through the following internal stages:

Out-of-sample prediction estimation: If pred_probs is not provided, the method performs stratified K-fold cross-validation (using cv_n_folds from initialization) to compute out-of-sample predicted probabilities. Each fold trains the wrapped classifier on the training portion and predicts on the held-out portion.
Confident joint estimation: The method estimates the confident joint matrix using cleanlab.count.compute_confident_joint, which counts examples that are confidently assigned to each (given_label, true_label) pair based on per-class thresholds.
Noise matrix computation: From the confident joint, the noise matrix (probability of given label given true label) and inverse noise matrix (probability of true label given observed label) are derived.
Label issue identification: The method calls cleanlab.filter.find_label_issues with the configured filter strategy to produce a boolean mask of detected label issues.
Quality score computation: Each example receives a label quality score via cleanlab.rank.get_label_quality_scores, providing a continuous measure of label trustworthiness.
Result assembly: All outputs are combined into a single pd.DataFrame indexed to match the input data.

The method also stores intermediate results on the instance: self.confident_joint, self.noise_matrix, self.inverse_noise_matrix, and self.pred_probs.

Usage

Call find_label_issues on a CleanLearning instance to detect mislabeled examples. This can be used standalone for data auditing or as a precursor to fit().

from cleanlab.classification import CleanLearning
from sklearn.linear_model import LogisticRegression

cl = CleanLearning(clf=LogisticRegression())
label_issues_df = cl.find_label_issues(X, labels)

# Filter to only mislabeled examples, sorted by quality
mislabeled = label_issues_df[label_issues_df["is_label_issue"]].sort_values("label_quality")
print(f"Found {len(mislabeled)} label issues")

Code Reference

Source Location

Repository: cleanlab/cleanlab
File: cleanlab/classification.py
Lines: 675--947

Signature

def find_label_issues(
    self,
    X=None,
    labels=None,
    *,
    pred_probs=None,
    thresholds=None,
    noise_matrix=None,
    inverse_noise_matrix=None,
    save_space=False,
    clf_kwargs={},
    validation_func=None,
) -> pd.DataFrame

Import

from cleanlab.classification import CleanLearning
# find_label_issues is a method of a CleanLearning instance

I/O Contract

Inputs

Name	Type	Required	Description
`X`	array-like (N, M)	Conditional	Feature matrix. Required if `pred_probs` is not provided.
`labels`	np.ndarray (N,)	Yes	Array of given (potentially noisy) integer class labels.
`pred_probs`	Optional[np.ndarray] (N, K)	No	Pre-computed out-of-sample predicted probabilities. If provided, cross-validation is skipped.
`thresholds`	Optional[np.ndarray] (K,)	No	Per-class thresholds for confident learning. Auto-computed if not provided.
`noise_matrix`	Optional[np.ndarray] (K, K)	No	Pre-computed noise matrix. Estimated from data if not provided.
`inverse_noise_matrix`	Optional[np.ndarray] (K, K)	No	Pre-computed inverse noise matrix. Estimated from data if not provided.
`save_space`	bool	No	If True, deletes intermediate data to reduce memory usage.
`clf_kwargs`	dict	No	Additional keyword arguments passed to the classifier's `fit()` during cross-validation.
`validation_func`	Optional[callable]	No	Optional validation function called after cross-validation to verify model quality.

Outputs

Column	Type	Description
`is_label_issue`	bool	Whether the example is identified as having a label issue.
`label_quality`	float	Quality score between 0 and 1. Lower values indicate more likely label issues.
`given_label`	int	The original (potentially noisy) label provided in the input.
`predicted_label`	int	The label predicted by the model based on the features.

The returned pd.DataFrame has one row per input example, indexed to match the input data.

Usage Examples

Basic Label Issue Detection

from cleanlab.classification import CleanLearning
from sklearn.ensemble import RandomForestClassifier
import numpy as np

cl = CleanLearning(clf=RandomForestClassifier(n_estimators=100), seed=42)
issues_df = cl.find_label_issues(X_train, labels=y_train)

# Count label issues
n_issues = issues_df["is_label_issue"].sum()
print(f"Detected {n_issues} label issues out of {len(y_train)} examples")

With Pre-computed Predicted Probabilities

from cleanlab.classification import CleanLearning
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression

# Compute pred_probs separately
clf = LogisticRegression()
pred_probs = cross_val_predict(clf, X_train, y_train, cv=5, method="predict_proba")

# Use pre-computed pred_probs (skips internal cross-validation)
cl = CleanLearning()
issues_df = cl.find_label_issues(X_train, labels=y_train, pred_probs=pred_probs)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment