Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Cleanlab Cleanlab Token Classification Label Quality

From Leeroopedia


Knowledge Sources
Domains Data_Centric_AI, NLP, Token_Classification, Label_Quality
Last Updated 2026-02-09 19:00 GMT

Overview

End-to-end process for detecting mislabeled tokens in Named Entity Recognition (NER) and other token classification datasets using cleanlab.

Description

This workflow detects label errors at the individual token level in token classification datasets. It handles the variable-length nature of text sequences by using a flatten/unflatten pattern: token-level labels and predicted probabilities from all sentences are processed to produce per-token quality scores, which are then aggregated into per-sentence scores. Sentences with the lowest scores are most likely to contain mislabeled tokens. The aggregation uses either a min or softmin method to combine individual token scores into a single sentence-level score.

Usage

Execute this workflow when you have a token classification dataset (such as NER, part-of-speech tagging, or chunking) where each token in each sentence has a class label, and you have trained a model that produces per-token predicted probabilities. This is appropriate for detecting annotation errors in datasets labeled with BIO/BIOES tagging schemes or similar token-level annotation formats, where individual tokens may have been assigned incorrect entity types.

Execution Steps

Step 1: Prepare Token Labels and Predictions

Format your token-level labels and model predictions into nested lists. Labels should be a nested list where each element contains the integer class labels for all tokens in one sentence. Predicted probabilities should be a list of numpy arrays, one per sentence, with shape (num_tokens, num_classes). Optionally prepare a nested list of token strings for annotated output.

Key considerations:

  • Labels format: nested list, labels[i] is a list of integer labels for sentence i
  • Pred_probs format: list of numpy arrays, pred_probs[i] has shape (T_i, K) for sentence i with T_i tokens
  • Class indices must be in 0, 1, ..., K-1 consistently across all sentences
  • Predictions should be out-of-sample (from a model not trained on that specific sentence)

Step 2: Compute Label Quality Scores

Call get_label_quality_scores from the token classification rank module with the prepared labels and predicted probabilities. This computes a quality score for each individual token and aggregates them into sentence-level scores. The token-level scoring uses the same methods as standard classification (self-confidence, normalized margin, or confidence-weighted entropy).

Key considerations:

  • Token scoring method (default: self_confidence) determines how individual tokens are scored
  • Sentence scoring method (default: min) determines how token scores are aggregated per sentence
  • Softmin aggregation is sensitive to all tokens in the sentence, not just the worst one
  • Returns both sentence-level scores and per-token score details

Step 3: Identify Sentences with Label Issues

Use the filter module to identify which sentences contain mislabeled tokens. This applies filtering based on the token-level quality scores to flag sentences that most likely contain at least one incorrectly labeled token.

Key considerations:

  • Sentences are ranked by their aggregate score; lower scores indicate more likely errors
  • The returned indices identify sentences for human review
  • Within flagged sentences, individual token scores pinpoint which specific tokens are suspect

Step 4: Display and Summarize Issues

Use the summary module to generate human-readable displays of detected issues. This includes color-coded token displays that highlight potentially mislabeled tokens within their sentence context, and aggregate statistics showing which entity classes are most affected by annotation errors.

Key considerations:

  • Color-coded output shows which tokens in a sentence are most suspect
  • Per-class statistics reveal systematic annotation patterns
  • Token-level scores within flagged sentences help annotators focus their review
  • Export identified issues for annotation correction workflows

Execution Diagram

GitHub URL

Workflow Repository