Principle:Cleanlab Cleanlab Token Label Quality Scoring
| Knowledge Sources | Cleanlab |
|---|---|
| Domains | Machine_Learning, Data_Quality, NLP |
| Last Updated | 2026-02-09 |
Overview
Method for scoring label quality at both the token level and sentence level in sequence labeling tasks like named entity recognition.
Description
Token label quality scoring adapts cleanlab's label quality scoring to variable-length sequences. It operates at two granularity levels:
- Token-level scoring: Each individual token in a sentence receives a quality score reflecting how likely its label is correct, based on the model's predicted class probabilities for that token.
- Sentence-level scoring: Token scores within each sentence are aggregated to produce a single per-sentence score, enabling ranking of entire sentences by their overall label quality.
The two-level design enables identifying both the most problematic sentences in a dataset and the specific tokens within them that are likely mislabeled. This is particularly useful for sequence labeling tasks such as named entity recognition (NER), part-of-speech tagging, and chunking, where labels are assigned to individual tokens within variable-length sentences.
Token-level scores are computed using standard label quality methods (such as self_confidence or normalized_margin) applied to the flattened set of all tokens across all sentences. Sentence-level aggregation then uses methods like min or softmin to summarize the token scores within each sentence.
Usage
Token label quality scoring is applied after training a token classification model (e.g., a NER model) and obtaining per-token predicted probabilities. The resulting scores support:
- Sentence ranking: Sorting sentences by quality score to prioritize human review on the most problematic examples.
- Token-level diagnostics: Identifying exactly which tokens within a sentence are likely mislabeled.
- Dataset auditing: Systematically evaluating the quality of sequence labeling annotations across a corpus.
Theoretical Basis
The scoring procedure operates as follows:
Step 1: Token-Level Scoring. For each sentence i with T_i tokens, compute per-token quality scores. The tokens and their predicted probabilities are logically flattened across all sentences, and standard label quality scoring is applied:
- self_confidence: The model's predicted probability for the given label class. Higher values indicate the model agrees with the annotation.
- normalized_margin: The difference between the probability of the given label and the highest probability for any other class, normalized to [0, 1].
self_confidence(token) = pred_probs[token][given_label]
normalized_margin(token) = pred_probs[token][given_label] - max(pred_probs[token][other_classes])
Step 2: Sentence-Level Aggregation. Aggregate token-level scores within each sentence to produce a single sentence score:
- min method: Takes the minimum token score in the sentence. Simple but sensitive to individual outlier tokens.
- softmin method: Computes a smooth approximation of the minimum that is more robust to outlier tokens:
softmin(scores) = sum(scores * exp(-scores / temperature)) / sum(exp(-scores / temperature))
The softmin method down-weights extremely low scores slightly, providing a more stable estimate of sentence quality while still emphasizing the worst tokens.