Implementation:Cleanlab Cleanlab TC Find Label Issues

API	`token_classification.filter.find_label_issues`
Source	`cleanlab/token_classification/filter.py:L15-101`
Domains	Machine_Learning, Data_Quality, NLP
Last Updated	2026-02-09

Overview

Implementation of token-level label issue detection for sequence labeling tasks. Identifies specific (sentence_index, token_index) pairs where the token label is likely incorrect, ordered by likelihood of being mislabeled.

Description

This function identifies mislabeled tokens in a token classification dataset by applying cleanlab's label issue detection to the flattened token-level data. It:

Flattens all token labels and predicted probabilities across sentences into unified arrays.
Applies standard label issue detection (confident learning) on the flattened data.
Maps flagged indices back to (sentence_index, token_index) tuples.
Returns results ordered by likelihood of mislabeling (most likely errors first).

The function supports a low_memory mode for large datasets that processes data more efficiently at the cost of some speed. Additional keyword arguments are passed through to the underlying filter.find_label_issues function.

Usage

This function is the primary entry point for detecting token-level label issues in sequence labeling tasks. It is typically used after training a token classifier and is commonly paired with the display function for human review of detected issues.

Code Reference

Source Location

cleanlab/token_classification/filter.py, lines 15-101.

Signature

def find_label_issues(
    labels: list,
    pred_probs: list,
    *,
    return_indices_ranked_by: str = "self_confidence",
    low_memory: bool = False,
    **kwargs,
) -> List[Tuple[int, int]]

Import

from cleanlab.token_classification.filter import find_label_issues

I/O Contract

Inputs

Parameter	Type	Description
`labels`	`list`	List of N lists, where each inner list contains integer class labels for each token in the corresponding sentence.
`pred_probs`	`list`	List of N numpy arrays, each of shape (T_i, K) where T_i is the number of tokens in sentence i and K is the number of classes.
`return_indices_ranked_by`	`str`	Method used to rank the returned token indices by likelihood of being mislabeled. Options include `"self_confidence"` and `"normalized_margin"`. Defaults to `"self_confidence"`.
`low_memory`	`bool`	If True, uses a more memory-efficient processing strategy for large datasets. Defaults to False.
`**kwargs`		Additional keyword arguments passed to the underlying `filter.find_label_issues` function.

Outputs

Type	Description
`List[Tuple[int, int]]`	List of (sentence_index, token_index) tuples identifying tokens with likely label issues. Ordered by likelihood of being mislabeled (most likely errors first).

Usage Examples

import numpy as np
from cleanlab.token_classification.filter import find_label_issues

# Labels for 3 sentences (0=O, 1=B-PER, 2=I-PER)
labels = [
    [0, 1, 2, 0],
    [0, 0, 1, 0, 0],
    [1, 2, 0],
]

# Predicted probabilities (K=3 classes)
pred_probs = [
    np.array([
        [0.9, 0.05, 0.05],
        [0.1, 0.8, 0.1],
        [0.1, 0.1, 0.8],
        [0.85, 0.1, 0.05],
    ]),
    np.array([
        [0.95, 0.03, 0.02],
        [0.88, 0.07, 0.05],
        [0.3, 0.4, 0.3],   # low confidence - possible issue
        [0.9, 0.05, 0.05],
        [0.92, 0.04, 0.04],
    ]),
    np.array([
        [0.15, 0.75, 0.1],
        [0.1, 0.2, 0.7],
        [0.8, 0.1, 0.1],
    ]),
]

# Find token-level label issues
issues = find_label_issues(labels, pred_probs)
# issues is a list of (sentence_index, token_index) tuples
# e.g., [(1, 2), (2, 0), ...] - most likely mislabeled tokens first

# With normalized_margin ranking
issues_margin = find_label_issues(
    labels,
    pred_probs,
    return_indices_ranked_by="normalized_margin",
)

# Low memory mode for large datasets
issues_low_mem = find_label_issues(
    labels,
    pred_probs,
    low_memory=True,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment