Principle:Cleanlab Cleanlab Token Issue Display

Knowledge Sources	Cleanlab
Domains	Machine_Learning, Data_Quality, NLP
Last Updated	2026-02-09

Overview

Visualization method that highlights problematic tokens within their original sentence context to facilitate human review of token-level label issues.

Description

Token issue display provides two complementary visualization capabilities for reviewing token classification label issues:

Issue display: Renders sentences with problematic tokens highlighted using color coding. It optionally shows the given label versus the predicted label for each flagged token, helping annotators understand what the model thinks the correct label should be.
Common issue summary: Aggregates error patterns across the dataset to identify systematic labeling mistakes, such as "B-PER frequently mislabeled as O" appearing N times.

Together, these capabilities support both individual sentence review and dataset-wide pattern analysis. The display function provides detailed per-sentence context, while the common issues function reveals systemic annotation problems that may warrant bulk corrections or annotator retraining.

Usage

Token issue display is the final step in a token classification quality audit workflow. After detecting token-level label issues, reviewers use these functions to:

Inspect individual issues: View sentences with highlighted problematic tokens and their given versus predicted labels.
Identify patterns: Discover recurring label errors that point to systematic annotation problems.
Guide corrections: Determine what the correct label should be based on the model's predictions and sentence context.
Prioritize effort: Focus on the most common error patterns that would have the largest impact on dataset quality.

Theoretical Basis

The display approach is based on two complementary techniques:

Token-Level Highlighting. For each flagged (sentence_index, token_index) pair:

For each sentence containing flagged tokens:
    For each token in sentence:
        If token is flagged:
            highlight(token, color=error_color)
            annotate(token, given_label=labels[i][j], predicted_label=argmax(pred_probs[i][j]))
        Else:
            render(token, color=default_color)

The highlighting makes it immediately obvious which tokens are problematic within the natural sentence context, and the label annotation shows what the model believes the correct label should be.

Error Pattern Aggregation. The common_label_issues function aggregates individual token errors into a frequency table of label transitions:

For each flagged (sentence_index, token_index):
    given_label = labels[sentence_index][token_index]
    predicted_label = argmax(pred_probs[sentence_index][token_index])
    error_pattern = (given_label -> predicted_label)
    increment count for error_pattern
Sort patterns by frequency (descending)
Return top-k patterns as a DataFrame

This reveals systematic errors such as "B-PER -> O" appearing 47 times, which may indicate a consistent boundary detection problem in the annotations.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment