Implementation:Cleanlab Cleanlab TC Get Label Quality Scores

API	`token_classification.rank.get_label_quality_scores`
Source	`cleanlab/token_classification/rank.py:L15-23`
Domains	Machine_Learning, Data_Quality, NLP
Last Updated	2026-02-09

Overview

Implementation of two-level label quality scoring for token classification tasks. Computes per-token quality scores and aggregates them to sentence-level scores for ranking and review.

Description

This function takes token-level labels and predicted probabilities for a corpus of sentences and returns both sentence-level and token-level quality scores. It internally:

Flattens all token labels and predicted probabilities across sentences into single arrays.
Computes per-token quality scores using the specified scoring method (e.g., self_confidence or normalized_margin).
Unflattens the token scores back to per-sentence groups.
Aggregates token scores within each sentence using the specified aggregation method (e.g., min or softmin).
Optionally creates DataFrames with token text when the tokens parameter is provided.

The function returns a tuple of sentence-level scores and token-level scores, enabling both coarse-grained (sentence ranking) and fine-grained (token identification) analysis.

Usage

This function is the primary entry point for scoring token classification label quality. It is used after training a token classification model and obtaining per-token predicted class probabilities on the training set. Results feed into downstream filtering and visualization functions.

Code Reference

Source Location

cleanlab/token_classification/rank.py, lines 15-23.

Signature

def get_label_quality_scores(
    labels: list,
    pred_probs: list,
    *,
    tokens: Optional[list] = None,
    token_score_method: str = "self_confidence",
    sentence_score_method: str = "min",
    sentence_score_kwargs: dict = {},
) -> Tuple[np.ndarray, list]

Import

from cleanlab.token_classification.rank import get_label_quality_scores

I/O Contract

Inputs

Parameter	Type	Description
`labels`	`list`	List of N lists, where each inner list contains integer class labels for each token in the corresponding sentence.
`pred_probs`	`list`	List of N numpy arrays, each of shape (T_i, K) where T_i is the number of tokens in sentence i and K is the number of classes. Each row contains the model's predicted class probabilities for a token.
`tokens`	`Optional[list]`	List of N lists, where each inner list contains the string tokens for the corresponding sentence. When provided, token-level scores are returned as DataFrames with token text.
`token_score_method`	`str`	Method for computing per-token scores. Options: `"self_confidence"` (probability of given label), `"normalized_margin"` (margin between given label and next best). Defaults to `"self_confidence"`.
`sentence_score_method`	`str`	Method for aggregating token scores to sentence level. Options: `"min"` (minimum token score), `"softmin"` (smooth minimum). Defaults to `"min"`.
`sentence_score_kwargs`	`dict`	Additional keyword arguments passed to the sentence-level aggregation method. Defaults to empty dict.

Outputs

Type	Description
`Tuple[np.ndarray, list]`	A tuple of (sentence_scores, token_scores). sentence_scores is a np.ndarray of shape (N,) with per-sentence quality scores between 0 and 1. token_scores is a list of N arrays (or DataFrames if tokens provided), each containing per-token quality scores for the corresponding sentence.

Usage Examples

import numpy as np
from cleanlab.token_classification.rank import get_label_quality_scores

# Labels for 3 sentences (using integer class IDs, e.g., 0=O, 1=B-PER, 2=I-PER)
labels = [
    [0, 1, 2, 0],        # 4 tokens
    [0, 0, 1, 0, 0],     # 5 tokens
    [1, 2, 0],            # 3 tokens
]

# Predicted probabilities (K=3 classes)
pred_probs = [
    np.array([
        [0.9, 0.05, 0.05],
        [0.1, 0.8, 0.1],
        [0.1, 0.1, 0.8],
        [0.85, 0.1, 0.05],
    ]),
    np.array([
        [0.95, 0.03, 0.02],
        [0.88, 0.07, 0.05],
        [0.2, 0.6, 0.2],
        [0.9, 0.05, 0.05],
        [0.92, 0.04, 0.04],
    ]),
    np.array([
        [0.15, 0.75, 0.1],
        [0.1, 0.2, 0.7],
        [0.8, 0.1, 0.1],
    ]),
]

# Compute quality scores
sentence_scores, token_scores = get_label_quality_scores(labels, pred_probs)
# sentence_scores: np.ndarray of shape (3,)
# token_scores: list of 3 arrays

# With token text and softmin aggregation
tokens = [
    ["John", "lives", "in", "Paris"],
    ["The", "weather", "is", "nice", "today"],
    ["Alice", "Smith", "left"],
]

sentence_scores, token_scores = get_label_quality_scores(
    labels,
    pred_probs,
    tokens=tokens,
    token_score_method="self_confidence",
    sentence_score_method="softmin",
)
# token_scores is now a list of DataFrames with token text and scores

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment