Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Cleanlab Cleanlab TC Find Label Issues

From Leeroopedia


API token_classification.filter.find_label_issues
Source cleanlab/token_classification/filter.py:L15-101
Domains Machine_Learning, Data_Quality, NLP
Last Updated 2026-02-09

Overview

Implementation of token-level label issue detection for sequence labeling tasks. Identifies specific (sentence_index, token_index) pairs where the token label is likely incorrect, ordered by likelihood of being mislabeled.

Description

This function identifies mislabeled tokens in a token classification dataset by applying cleanlab's label issue detection to the flattened token-level data. It:

  1. Flattens all token labels and predicted probabilities across sentences into unified arrays.
  2. Applies standard label issue detection (confident learning) on the flattened data.
  3. Maps flagged indices back to (sentence_index, token_index) tuples.
  4. Returns results ordered by likelihood of mislabeling (most likely errors first).

The function supports a low_memory mode for large datasets that processes data more efficiently at the cost of some speed. Additional keyword arguments are passed through to the underlying filter.find_label_issues function.

Usage

This function is the primary entry point for detecting token-level label issues in sequence labeling tasks. It is typically used after training a token classifier and is commonly paired with the display function for human review of detected issues.

Code Reference

Source Location

cleanlab/token_classification/filter.py, lines 15-101.

Signature

def find_label_issues(
    labels: list,
    pred_probs: list,
    *,
    return_indices_ranked_by: str = "self_confidence",
    low_memory: bool = False,
    **kwargs,
) -> List[Tuple[int, int]]

Import

from cleanlab.token_classification.filter import find_label_issues

I/O Contract

Inputs

Parameter Type Description
labels list List of N lists, where each inner list contains integer class labels for each token in the corresponding sentence.
pred_probs list List of N numpy arrays, each of shape (T_i, K) where T_i is the number of tokens in sentence i and K is the number of classes.
return_indices_ranked_by str Method used to rank the returned token indices by likelihood of being mislabeled. Options include "self_confidence" and "normalized_margin". Defaults to "self_confidence".
low_memory bool If True, uses a more memory-efficient processing strategy for large datasets. Defaults to False.
**kwargs Additional keyword arguments passed to the underlying filter.find_label_issues function.

Outputs

Type Description
List[Tuple[int, int]] List of (sentence_index, token_index) tuples identifying tokens with likely label issues. Ordered by likelihood of being mislabeled (most likely errors first).

Usage Examples

import numpy as np
from cleanlab.token_classification.filter import find_label_issues

# Labels for 3 sentences (0=O, 1=B-PER, 2=I-PER)
labels = [
    [0, 1, 2, 0],
    [0, 0, 1, 0, 0],
    [1, 2, 0],
]

# Predicted probabilities (K=3 classes)
pred_probs = [
    np.array([
        [0.9, 0.05, 0.05],
        [0.1, 0.8, 0.1],
        [0.1, 0.1, 0.8],
        [0.85, 0.1, 0.05],
    ]),
    np.array([
        [0.95, 0.03, 0.02],
        [0.88, 0.07, 0.05],
        [0.3, 0.4, 0.3],   # low confidence - possible issue
        [0.9, 0.05, 0.05],
        [0.92, 0.04, 0.04],
    ]),
    np.array([
        [0.15, 0.75, 0.1],
        [0.1, 0.2, 0.7],
        [0.8, 0.1, 0.1],
    ]),
]

# Find token-level label issues
issues = find_label_issues(labels, pred_probs)
# issues is a list of (sentence_index, token_index) tuples
# e.g., [(1, 2), (2, 0), ...] - most likely mislabeled tokens first

# With normalized_margin ranking
issues_margin = find_label_issues(
    labels,
    pred_probs,
    return_indices_ranked_by="normalized_margin",
)

# Low memory mode for large datasets
issues_low_mem = find_label_issues(
    labels,
    pred_probs,
    low_memory=True,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment