Principle:Cleanlab Cleanlab Multilabel Label Quality Scoring
| Knowledge Sources | |
|---|---|
| Domains | Multi-Label Classification, Data Quality, Confident Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A scoring framework that quantifies the quality of multi-label annotations by decomposing the problem into independent binary classification quality assessments per class and then aggregating these per-class scores into an overall example-level quality measure.
Description
In multi-label classification, each example can belong to zero, one, or many classes simultaneously, making label quality assessment more complex than in standard multi-class settings. A label error can take multiple forms: a class that should be present may be missing, a class that should be absent may be incorrectly included, or both types of errors may co-occur for different classes within the same example.
Multilabel label quality scoring addresses this by treating the multi-label problem as K independent binary classification problems (one per class) using a one-vs-rest decomposition. For each class, the model's predicted probability and the given binary annotation (present or absent) are used to compute a quality score using established binary classification scoring methods. These K per-class scores are then combined into a single overall score using an aggregation function that emphasizes the worst individual scores, since a single incorrect class annotation renders the entire multi-label annotation problematic.
Usage
This scoring principle is the right choice when:
- Each example in your dataset can have multiple class labels simultaneously.
- Model-predicted probabilities do not necessarily sum to 1 across classes.
- You need a single numeric score per example to rank and prioritize data for review.
- You want flexibility in choosing both the per-class scoring method and the aggregation strategy.
Theoretical Basis
One-vs-Rest Decomposition
The multi-label scoring problem for K classes is decomposed into K independent binary classification problems. For class k and example i:
- The binary label is
y_ik = 1if class k is in the label set for example i, elsey_ik = 0. - The predicted probability is
p_ik, the model's predicted probability that class k applies to example i.
Per-Class Scoring Methods
Three scoring methods are available for computing the quality score s_ik for class k on example i:
- Self-Confidence: The model's predicted probability of the given label. For a binary setting:
s_ik = p_ikify_ik = 1, ors_ik = 1 - p_ikify_ik = 0. This directly measures how confident the model is that the given annotation is correct.
- Normalized Margin: The difference between the model's predicted probability of the given class and the maximum probability of any other class, normalized to [0, 1]. This captures not just confidence in the given label but also how much more confident the model is in the given label compared to alternatives.
- Confidence-Weighted Entropy: Combines the self-confidence score with the entropy of the predicted probability distribution, penalizing examples where the model is highly uncertain even if the top prediction agrees with the given label.
Aggregation Strategies
After computing K per-class scores for each example, these must be aggregated into a single example-level score. The aggregation must be sensitive to poor individual class scores because even one incorrect class annotation constitutes a label issue:
- Exponential Moving Average (default, alpha=0.8): Sorts per-class scores and computes a weighted average that gives exponentially more weight to the lowest scores. With alpha=0.8, this heavily emphasizes the worst class annotations while still considering the overall pattern. The formula operates on sorted scores where the lowest quality scores receive the highest weights.
- Softmin: A differentiable approximation of the minimum function that computes a weighted average where weights are determined by a softmax over negative scores. This produces scores close to the minimum per-class score while being smoother and less sensitive to a single outlier.
- Custom Callable: Users can provide their own aggregation function for domain-specific needs.
Why Aggregation Matters
Simple aggregation methods like taking the mean would dilute the signal from a single mislabeled class among many correctly labeled classes. For example, if an example has 10 classes and 9 are correctly annotated (score near 1.0) but one is incorrect (score near 0.0), the mean score would be approximately 0.9, failing to flag this example as problematic. The exponential moving average and softmin methods ensure that the overall score is pulled down significantly by even a single low per-class score.