Principle:Cleanlab Cleanlab Multilabel Quality Scoring
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Multi-label Classification, Machine Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Multilabel quality scoring is the process of assessing the correctness of each example's labels in a multi-label classification setting by decomposing the problem into independent binary tasks and then aggregating per-class quality scores into a single overall score.
Description
In multi-label classification, each example can belong to zero or more classes simultaneously. Detecting label errors in this setting is more complex than in standard multi-class classification because errors can occur independently in any subset of the K class labels for a given example. Rather than developing entirely new scoring methods, multilabel quality scoring leverages a decompose-then-aggregate strategy:
Decomposition: The K-class multi-label problem is treated as K independent binary classification problems. For each class k, the binary label (0 or 1) and the predicted probability for that class are extracted. The predicted probability is converted to a two-column format [1 - p, p] to create a standard binary classification probability matrix. An existing binary label quality scoring method (e.g., self-confidence, normalized margin, or confidence-weighted entropy) is then applied to each binary problem independently.
Aggregation: The K per-class quality scores for each example are combined into a single overall label quality score using an aggregation function. The aggregation must be sensitive to the worst-scoring class (since a single mislabeled class constitutes a label error) while remaining robust to the distribution of scores across classes.
The result is a single score per example in [0, 1], where lower scores indicate examples more likely to have at least one incorrect label.
Usage
Multilabel quality scoring is appropriate whenever you have a multi-label classification dataset (where labels are represented as binary indicator matrices) and want to identify examples that may have one or more incorrect class assignments. It requires out-of-sample predicted probabilities from a trained multi-label classifier.
Theoretical Basis
Binary Decomposition
Given multi-label labels Y of shape (N, K) and predicted probabilities P of shape (N, K), for each class k:
- Binary labels: y_k = Y[:, k] (values in {0, 1})
- Binary pred probs: p_k = [[1 - P[:, k], P[:, k]]] (shape N x 2)
- Per-class score: s_k = scorer(y_k, p_k) (shape N)
This yields a score matrix S of shape (N, K) where S[i][k] is the quality score for example i's label on class k.
Scoring Methods
Three binary scoring methods are available:
Self-Confidence: The predicted probability assigned to the given label. For a binary problem with label y and predicted probability p, self_confidence = p if y = 1, or (1 - p) if y = 0.
Normalized Margin: The difference between the probability assigned to the given label and the maximum probability assigned to any other class, normalized to [0, 1].
Confidence-Weighted Entropy: An entropy-based measure that weights the predictive uncertainty by the model's confidence in the given label.
Aggregation Methods
Exponential Moving Average (EMA): The per-class scores are sorted in descending order, and the EMA is computed recursively:
EMA_1 = s_1 (largest score)
EMA_t = alpha * s_t + (1 - alpha) * EMA_{t-1}, for t = 2, ..., K
where alpha is a forgetting factor in [0, 1]. A higher alpha (default 0.8) places more weight on the lower-scoring classes, making the aggregate more sensitive to individual class errors. The default alpha of 2/(K+1) can be overridden.
Softmin: A differentiable approximation to the minimum function, computed as:
softmin(s) = sum_k(s_k * softmax((1 - s) / temperature)_k)
A lower temperature makes this closer to the hard minimum, while a higher temperature makes it closer to the mean. The default temperature is 0.1.
Cross-Validation for Predicted Probabilities
To obtain unbiased predicted probabilities required for scoring, the module supports cross-validated prediction. Multi-label datasets are stratified by treating each unique combination of binary labels as a distinct "class" for stratification purposes, ensuring that rare label combinations are represented in each fold.