Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Cleanlab Cleanlab Find Label Issues

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Quality
Last Updated 2026-02-09 19:00 GMT

Overview

Concrete tool for identifying mislabeled examples in a dataset using confident learning filtering strategies provided by the Cleanlab library.

Description

This function takes noisy labels and out-of-sample predicted probabilities and returns either a boolean mask or sorted indices indicating which examples are estimated to have label issues. It supports 7 different filtering strategies via the filter_by parameter. Internally, it computes the confident joint (if not provided), estimates per-class noise rates, computes label quality scores, and applies the selected filtering strategy to identify the estimated number of label errors. The function also supports multi-label classification, parallel execution via n_jobs, and fine-grained control over the number of issues to flag per class.

Usage

Import and use this function as the primary entry point for detecting label issues in your dataset. You need out-of-sample predicted probabilities (obtained via cross-validation or a held-out model) and the noisy labels. This is the most commonly used function in cleanlab for identifying specific mislabeled examples.

Code Reference

Source Location

  • Repository: cleanlab
  • File: cleanlab/filter.py
  • Lines: 57-451

Signature

def find_label_issues(
    labels,
    pred_probs,
    *,
    return_indices_ranked_by=None,
    rank_by_kwargs=None,
    filter_by="prune_by_noise_rate",
    frac_noise=1.0,
    num_to_remove_per_class=None,
    min_examples_per_class=1,
    confident_joint=None,
    n_jobs=None,
    verbose=False,
    multi_label=False,
) -> np.ndarray

Import

from cleanlab.filter import find_label_issues

I/O Contract

Inputs

Name Type Required Description
labels LabelLike Yes Array of noisy class labels of shape (N,) with integer values 0..K-1.
pred_probs np.ndarray Yes Out-of-sample predicted probability matrix of shape (N, K).
return_indices_ranked_by Optional[str] No If set (e.g., "self_confidence", "normalized_margin", "confidence_weighted_entropy"), returns sorted indices instead of a boolean mask. The indices are sorted by the chosen quality score in ascending order (most likely mislabeled first).
rank_by_kwargs Optional[dict] No Additional keyword arguments passed to the ranking/scoring method.
filter_by str No Filtering strategy. One of "prune_by_noise_rate" (default), "prune_by_class", "both", "confident_learning", "predicted_neq_given", "low_normalized_margin", "low_self_confidence".
frac_noise float No Fraction of estimated noise to flag. 1.0 (default) flags the estimated number of issues; values above 1.0 flag more, below 1.0 flag fewer.
num_to_remove_per_class Optional[list] No Explicit per-class counts of issues to flag, overriding the estimated counts.
min_examples_per_class int No Minimum number of examples to retain per class after removing issues. Defaults to 1.
confident_joint Optional[np.ndarray] No Pre-computed confident joint of shape (K, K). If None, computed internally.
n_jobs Optional[int] No Number of parallel jobs for computation. Defaults to None (single-threaded).
verbose bool No If True, print progress information. Defaults to False.
multi_label bool No If True, handle multi-label classification. Defaults to False.

Outputs

Name Type Description
label_issues np.ndarray If return_indices_ranked_by is None: boolean mask of shape (N,) where True indicates a detected label issue. If return_indices_ranked_by is set: array of integer indices sorted by the chosen quality score ascending (most likely mislabeled first).

Usage Examples

Basic Usage (Boolean Mask)

import numpy as np
from cleanlab.filter import find_label_issues

labels = np.array([0, 0, 1, 1, 2, 2])
pred_probs = np.array([
    [0.9, 0.05, 0.05],
    [0.2, 0.7, 0.1],   # labeled 0 but model predicts 1
    [0.1, 0.8, 0.1],
    [0.05, 0.1, 0.85],  # labeled 1 but model predicts 2
    [0.1, 0.1, 0.8],
    [0.05, 0.05, 0.9],
])

issue_mask = find_label_issues(labels, pred_probs)
print("Label issues detected at indices:", np.where(issue_mask)[0])

Ranked Indices

from cleanlab.filter import find_label_issues

# Get indices sorted by self_confidence (most likely mislabeled first)
ranked_issues = find_label_issues(
    labels, pred_probs,
    return_indices_ranked_by="self_confidence",
)
print("Issues ranked by severity:", ranked_issues)

Using Different Filtering Strategies

from cleanlab.filter import find_label_issues

# Conservative approach: intersection of two methods
issue_mask = find_label_issues(
    labels, pred_probs,
    filter_by="both",
)

# Find more issues by scaling the noise estimate
issue_mask = find_label_issues(
    labels, pred_probs,
    filter_by="prune_by_noise_rate",
    frac_noise=1.5,
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment