Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Cleanlab Cleanlab CleanLearning Find Label Issues

From Leeroopedia


Field Value
Sources Confident Learning, Cleanlab
Domains Machine_Learning, Data_Quality
Last Updated 2026-02-09 12:00 GMT

Overview

CleanLearning.find_label_issues performs end-to-end label issue detection by combining cross-validation, confident learning, and quality scoring into a single method call.

Description

The find_label_issues method orchestrates the full label issue detection pipeline within the CleanLearning wrapper. It accepts training features and labels, optionally pre-computed predicted probabilities, and returns a structured DataFrame identifying which examples are likely mislabeled.

The method proceeds through the following internal stages:

  1. Out-of-sample prediction estimation: If pred_probs is not provided, the method performs stratified K-fold cross-validation (using cv_n_folds from initialization) to compute out-of-sample predicted probabilities. Each fold trains the wrapped classifier on the training portion and predicts on the held-out portion.
  2. Confident joint estimation: The method estimates the confident joint matrix using cleanlab.count.compute_confident_joint, which counts examples that are confidently assigned to each (given_label, true_label) pair based on per-class thresholds.
  3. Noise matrix computation: From the confident joint, the noise matrix (probability of given label given true label) and inverse noise matrix (probability of true label given observed label) are derived.
  4. Label issue identification: The method calls cleanlab.filter.find_label_issues with the configured filter strategy to produce a boolean mask of detected label issues.
  5. Quality score computation: Each example receives a label quality score via cleanlab.rank.get_label_quality_scores, providing a continuous measure of label trustworthiness.
  6. Result assembly: All outputs are combined into a single pd.DataFrame indexed to match the input data.

The method also stores intermediate results on the instance: self.confident_joint, self.noise_matrix, self.inverse_noise_matrix, and self.pred_probs.

Usage

Call find_label_issues on a CleanLearning instance to detect mislabeled examples. This can be used standalone for data auditing or as a precursor to fit().

from cleanlab.classification import CleanLearning
from sklearn.linear_model import LogisticRegression

cl = CleanLearning(clf=LogisticRegression())
label_issues_df = cl.find_label_issues(X, labels)

# Filter to only mislabeled examples, sorted by quality
mislabeled = label_issues_df[label_issues_df["is_label_issue"]].sort_values("label_quality")
print(f"Found {len(mislabeled)} label issues")

Code Reference

Source Location

Repository
cleanlab/cleanlab
File
cleanlab/classification.py
Lines
675--947

Signature

def find_label_issues(
    self,
    X=None,
    labels=None,
    *,
    pred_probs=None,
    thresholds=None,
    noise_matrix=None,
    inverse_noise_matrix=None,
    save_space=False,
    clf_kwargs={},
    validation_func=None,
) -> pd.DataFrame

Import

from cleanlab.classification import CleanLearning
# find_label_issues is a method of a CleanLearning instance

I/O Contract

Inputs

Name Type Required Description
X array-like (N, M) Conditional Feature matrix. Required if pred_probs is not provided.
labels np.ndarray (N,) Yes Array of given (potentially noisy) integer class labels.
pred_probs Optional[np.ndarray] (N, K) No Pre-computed out-of-sample predicted probabilities. If provided, cross-validation is skipped.
thresholds Optional[np.ndarray] (K,) No Per-class thresholds for confident learning. Auto-computed if not provided.
noise_matrix Optional[np.ndarray] (K, K) No Pre-computed noise matrix. Estimated from data if not provided.
inverse_noise_matrix Optional[np.ndarray] (K, K) No Pre-computed inverse noise matrix. Estimated from data if not provided.
save_space bool No If True, deletes intermediate data to reduce memory usage.
clf_kwargs dict No Additional keyword arguments passed to the classifier's fit() during cross-validation.
validation_func Optional[callable] No Optional validation function called after cross-validation to verify model quality.

Outputs

Column Type Description
is_label_issue bool Whether the example is identified as having a label issue.
label_quality float Quality score between 0 and 1. Lower values indicate more likely label issues.
given_label int The original (potentially noisy) label provided in the input.
predicted_label int The label predicted by the model based on the features.

The returned pd.DataFrame has one row per input example, indexed to match the input data.

Usage Examples

Basic Label Issue Detection

from cleanlab.classification import CleanLearning
from sklearn.ensemble import RandomForestClassifier
import numpy as np

cl = CleanLearning(clf=RandomForestClassifier(n_estimators=100), seed=42)
issues_df = cl.find_label_issues(X_train, labels=y_train)

# Count label issues
n_issues = issues_df["is_label_issue"].sum()
print(f"Detected {n_issues} label issues out of {len(y_train)} examples")

With Pre-computed Predicted Probabilities

from cleanlab.classification import CleanLearning
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression

# Compute pred_probs separately
clf = LogisticRegression()
pred_probs = cross_val_predict(clf, X_train, y_train, cv=5, method="predict_proba")

# Use pre-computed pred_probs (skips internal cross-validation)
cl = CleanLearning()
issues_df = cl.find_label_issues(X_train, labels=y_train, pred_probs=pred_probs)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment