Principle:Cleanlab Cleanlab Multilabel Label Quality Scoring

Knowledge Sources	Cleanlab
Domains	Multi-Label Classification, Data Quality, Confident Learning
Last Updated	2026-02-09 00:00 GMT

Overview

A scoring framework that quantifies the quality of multi-label annotations by decomposing the problem into independent binary classification quality assessments per class and then aggregating these per-class scores into an overall example-level quality measure.

Description

In multi-label classification, each example can belong to zero, one, or many classes simultaneously, making label quality assessment more complex than in standard multi-class settings. A label error can take multiple forms: a class that should be present may be missing, a class that should be absent may be incorrectly included, or both types of errors may co-occur for different classes within the same example.

Multilabel label quality scoring addresses this by treating the multi-label problem as K independent binary classification problems (one per class) using a one-vs-rest decomposition. For each class, the model's predicted probability and the given binary annotation (present or absent) are used to compute a quality score using established binary classification scoring methods. These K per-class scores are then combined into a single overall score using an aggregation function that emphasizes the worst individual scores, since a single incorrect class annotation renders the entire multi-label annotation problematic.

Usage

This scoring principle is the right choice when:

Each example in your dataset can have multiple class labels simultaneously.
Model-predicted probabilities do not necessarily sum to 1 across classes.
You need a single numeric score per example to rank and prioritize data for review.
You want flexibility in choosing both the per-class scoring method and the aggregation strategy.

Theoretical Basis

One-vs-Rest Decomposition

The multi-label scoring problem for K classes is decomposed into K independent binary classification problems. For class k and example i:

The binary label is y_ik = 1 if class k is in the label set for example i, else y_ik = 0.
The predicted probability is p_ik, the model's predicted probability that class k applies to example i.

Per-Class Scoring Methods

Three scoring methods are available for computing the quality score s_ik for class k on example i:

Self-Confidence: The model's predicted probability of the given label. For a binary setting: s_ik = p_ik if y_ik = 1, or s_ik = 1 - p_ik if y_ik = 0. This directly measures how confident the model is that the given annotation is correct.

Normalized Margin: The difference between the model's predicted probability of the given class and the maximum probability of any other class, normalized to [0, 1]. This captures not just confidence in the given label but also how much more confident the model is in the given label compared to alternatives.

Confidence-Weighted Entropy: Combines the self-confidence score with the entropy of the predicted probability distribution, penalizing examples where the model is highly uncertain even if the top prediction agrees with the given label.

Aggregation Strategies

After computing K per-class scores for each example, these must be aggregated into a single example-level score. The aggregation must be sensitive to poor individual class scores because even one incorrect class annotation constitutes a label issue:

Exponential Moving Average (default, alpha=0.8): Sorts per-class scores and computes a weighted average that gives exponentially more weight to the lowest scores. With alpha=0.8, this heavily emphasizes the worst class annotations while still considering the overall pattern. The formula operates on sorted scores where the lowest quality scores receive the highest weights.

Softmin: A differentiable approximation of the minimum function that computes a weighted average where weights are determined by a softmax over negative scores. This produces scores close to the minimum per-class score while being smoother and less sensitive to a single outlier.

Custom Callable: Users can provide their own aggregation function for domain-specific needs.

Why Aggregation Matters

Simple aggregation methods like taking the mean would dilute the signal from a single mislabeled class among many correctly labeled classes. For example, if an example has 10 classes and 9 are correctly annotated (score near 1.0) but one is incorrect (score near 0.0), the mean score would be approximately 0.9, failing to flag this example as problematic. The exponential moving average and softmin methods ensure that the overall score is pulled down significantly by even a single low per-class score.

Related Pages

Implementation:Cleanlab_Cleanlab_Multilabel_Get_Label_Quality_Scores

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment