Implementation:Cleanlab Cleanlab TC Get Label Quality Scores
| API | token_classification.rank.get_label_quality_scores
|
|---|---|
| Source | cleanlab/token_classification/rank.py:L15-23
|
| Domains | Machine_Learning, Data_Quality, NLP |
| Last Updated | 2026-02-09 |
Overview
Implementation of two-level label quality scoring for token classification tasks. Computes per-token quality scores and aggregates them to sentence-level scores for ranking and review.
Description
This function takes token-level labels and predicted probabilities for a corpus of sentences and returns both sentence-level and token-level quality scores. It internally:
- Flattens all token labels and predicted probabilities across sentences into single arrays.
- Computes per-token quality scores using the specified scoring method (e.g.,
self_confidenceornormalized_margin). - Unflattens the token scores back to per-sentence groups.
- Aggregates token scores within each sentence using the specified aggregation method (e.g.,
minorsoftmin). - Optionally creates DataFrames with token text when the
tokensparameter is provided.
The function returns a tuple of sentence-level scores and token-level scores, enabling both coarse-grained (sentence ranking) and fine-grained (token identification) analysis.
Usage
This function is the primary entry point for scoring token classification label quality. It is used after training a token classification model and obtaining per-token predicted class probabilities on the training set. Results feed into downstream filtering and visualization functions.
Code Reference
Source Location
cleanlab/token_classification/rank.py, lines 15-23.
Signature
def get_label_quality_scores(
labels: list,
pred_probs: list,
*,
tokens: Optional[list] = None,
token_score_method: str = "self_confidence",
sentence_score_method: str = "min",
sentence_score_kwargs: dict = {},
) -> Tuple[np.ndarray, list]
Import
from cleanlab.token_classification.rank import get_label_quality_scores
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
labels |
list |
List of N lists, where each inner list contains integer class labels for each token in the corresponding sentence. |
pred_probs |
list |
List of N numpy arrays, each of shape (T_i, K) where T_i is the number of tokens in sentence i and K is the number of classes. Each row contains the model's predicted class probabilities for a token. |
tokens |
Optional[list] |
List of N lists, where each inner list contains the string tokens for the corresponding sentence. When provided, token-level scores are returned as DataFrames with token text. |
token_score_method |
str |
Method for computing per-token scores. Options: "self_confidence" (probability of given label), "normalized_margin" (margin between given label and next best). Defaults to "self_confidence".
|
sentence_score_method |
str |
Method for aggregating token scores to sentence level. Options: "min" (minimum token score), "softmin" (smooth minimum). Defaults to "min".
|
sentence_score_kwargs |
dict |
Additional keyword arguments passed to the sentence-level aggregation method. Defaults to empty dict. |
Outputs
| Type | Description |
|---|---|
Tuple[np.ndarray, list] |
A tuple of (sentence_scores, token_scores). sentence_scores is a np.ndarray of shape (N,) with per-sentence quality scores between 0 and 1. token_scores is a list of N arrays (or DataFrames if tokens provided), each containing per-token quality scores for the corresponding sentence. |
Usage Examples
import numpy as np
from cleanlab.token_classification.rank import get_label_quality_scores
# Labels for 3 sentences (using integer class IDs, e.g., 0=O, 1=B-PER, 2=I-PER)
labels = [
[0, 1, 2, 0], # 4 tokens
[0, 0, 1, 0, 0], # 5 tokens
[1, 2, 0], # 3 tokens
]
# Predicted probabilities (K=3 classes)
pred_probs = [
np.array([
[0.9, 0.05, 0.05],
[0.1, 0.8, 0.1],
[0.1, 0.1, 0.8],
[0.85, 0.1, 0.05],
]),
np.array([
[0.95, 0.03, 0.02],
[0.88, 0.07, 0.05],
[0.2, 0.6, 0.2],
[0.9, 0.05, 0.05],
[0.92, 0.04, 0.04],
]),
np.array([
[0.15, 0.75, 0.1],
[0.1, 0.2, 0.7],
[0.8, 0.1, 0.1],
]),
]
# Compute quality scores
sentence_scores, token_scores = get_label_quality_scores(labels, pred_probs)
# sentence_scores: np.ndarray of shape (3,)
# token_scores: list of 3 arrays
# With token text and softmin aggregation
tokens = [
["John", "lives", "in", "Paris"],
["The", "weather", "is", "nice", "today"],
["Alice", "Smith", "left"],
]
sentence_scores, token_scores = get_label_quality_scores(
labels,
pred_probs,
tokens=tokens,
token_score_method="self_confidence",
sentence_score_method="softmin",
)
# token_scores is now a list of DataFrames with token text and scores