Implementation:Cleanlab Cleanlab Multilabel Get Label Quality Scores
| Knowledge Sources | |
|---|---|
| Domains | Multi-Label Classification, Data Quality, Label Scoring |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Computes a label quality score for each example in a multi-label classification dataset, quantifying how likely each example's set of class annotations is correct.
Description
The get_label_quality_scores function in the multilabel classification rank module computes per-example quality scores between 0 and 1, where lower scores indicate examples whose labels more likely contain annotation errors. It handles the multi-label setting where each example can belong to zero, one, or multiple classes simultaneously and model-predicted probabilities need not sum to 1 across classes.
The function works by first converting multi-label lists to binary one-hot representations, then computing separate quality scores for each class using a configurable scoring method (e.g., self_confidence, normalized_margin, or confidence_weighted_entropy) via a one-vs-rest approach. These per-class scores are then aggregated into a single example-level score using an Aggregator (default: exponential moving average with alpha=0.8). A companion function, get_label_quality_scores_per_class, returns the unaggregated per-class scores.
Usage
Import this function when you have a multi-label classification dataset and want to identify which examples are most likely to have incorrect label annotations. Use it to rank examples by annotation quality for data cleaning or review prioritization. This is the core ranking function exported at the package level via cleanlab.multilabel_classification.
Code Reference
Source Location
- Repository: Cleanlab
- File: cleanlab/multilabel_classification/rank.py
- Lines: 53-121 (get_label_quality_scores), 124-179 (get_label_quality_scores_per_class)
Signature
def get_label_quality_scores(
labels: List[List[int]],
pred_probs: npt.NDArray["np.floating[T]"],
*,
method: str = "self_confidence",
adjust_pred_probs: bool = False,
aggregator_kwargs: Dict[str, Any] = {"method": "exponential_moving_average", "alpha": 0.8},
) -> npt.NDArray["np.floating[T]"]:
def get_label_quality_scores_per_class(
labels: List[List[int]],
pred_probs: npt.NDArray["np.floating[T]"],
*,
method: str = "self_confidence",
adjust_pred_probs: bool = False,
) -> np.ndarray:
Import
from cleanlab.multilabel_classification.rank import get_label_quality_scores
from cleanlab.multilabel_classification.rank import get_label_quality_scores_per_class
I/O Contract
Inputs (get_label_quality_scores)
| Name | Type | Required | Description |
|---|---|---|---|
| labels | List[List[int]] | Yes | List of noisy multi-label annotations. Each inner list contains the class indices that apply to that example (e.g., [[1], [0, 2]] means example 0 has class 1 and example 1 has classes 0 and 2). |
| pred_probs | np.ndarray | Yes | Array of shape (N, K) with model-predicted class probabilities, where N is the number of examples and K is the number of classes. Probabilities need not sum to 1 per row. |
| method | str | No (default: "self_confidence") | Scoring method for per-class annotation scores. Options: "self_confidence", "normalized_margin", "confidence_weighted_entropy". |
| adjust_pred_probs | bool | No (default: False) | Whether to adjust predicted probabilities to account for class imbalance. |
| aggregator_kwargs | Dict[str, Any] | No (default: {"method": "exponential_moving_average", "alpha": 0.8}) | Hyperparameters for aggregating per-class scores. Options for "method": "exponential_moving_average", "softmin", or a custom callable. |
Outputs (get_label_quality_scores)
| Name | Type | Description |
|---|---|---|
| label_quality_scores | np.ndarray | 1D array of shape (N,) with quality scores between 0 and 1. Lower scores indicate examples more likely to contain annotation errors. |
Inputs (get_label_quality_scores_per_class)
| Name | Type | Required | Description |
|---|---|---|---|
| labels | List[List[int]] | Yes | Multi-label annotations (same format as above) |
| pred_probs | np.ndarray | Yes | Model predictions of shape (N, K) (same format as above) |
| method | str | No (default: "self_confidence") | Scoring method for per-class annotation scores |
| adjust_pred_probs | bool | No (default: False) | Whether to adjust for class imbalance |
Outputs (get_label_quality_scores_per_class)
| Name | Type | Description |
|---|---|---|
| label_quality_scores | list(np.ndarray) | List of K arrays, each of shape (N,). label_quality_scores[k][i] is the quality score for class k's annotation on example i. |
Internal Pipeline
The scoring pipeline consists of three stages:
- Validation and Conversion: Input labels are validated via
assert_valid_inputsand converted from list-of-lists format to binary one-hot representation usingint2onehot. - Per-Class Scoring: A
MultilabelScoreris created via the factory function_create_multilabel_scorer, which wraps aClassLabelScorer(specifying the scoring method) and anAggregator(specifying how to combine per-class scores). - Aggregation: Per-class binary classification quality scores are combined into a single example-level score using the configured aggregation method.
Usage Examples
Basic Usage
from cleanlab.multilabel_classification.rank import get_label_quality_scores
import numpy as np
# Example: 2 examples, 3 classes
labels = [[1], [0, 2]]
pred_probs = np.array([[0.1, 0.9, 0.1], [0.4, 0.1, 0.9]])
scores = get_label_quality_scores(labels, pred_probs)
print(scores) # array([0.9, 0.5])
Per-Class Scores
from cleanlab.multilabel_classification.rank import get_label_quality_scores_per_class
import numpy as np
labels = [[1], [0, 2]]
pred_probs = np.array([[0.1, 0.9, 0.1], [0.4, 0.1, 0.9]])
per_class_scores = get_label_quality_scores_per_class(labels, pred_probs)
# Returns list of 3 arrays (one per class), each of length 2 (one per example)
Custom Aggregation
from cleanlab.multilabel_classification.rank import get_label_quality_scores
import numpy as np
labels = [[1], [0, 2]]
pred_probs = np.array([[0.1, 0.9, 0.1], [0.4, 0.1, 0.9]])
# Use softmin aggregation instead of default EMA
scores = get_label_quality_scores(
labels, pred_probs,
method="normalized_margin",
aggregator_kwargs={"method": "softmin"},
)