Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Cleanlab Cleanlab Multilabel Quality Scoring

From Leeroopedia


Knowledge Sources
Domains Data Quality, Multi-label Classification, Machine Learning
Last Updated 2026-02-09 00:00 GMT

Overview

Multilabel quality scoring is the process of assessing the correctness of each example's labels in a multi-label classification setting by decomposing the problem into independent binary tasks and then aggregating per-class quality scores into a single overall score.

Description

In multi-label classification, each example can belong to zero or more classes simultaneously. Detecting label errors in this setting is more complex than in standard multi-class classification because errors can occur independently in any subset of the K class labels for a given example. Rather than developing entirely new scoring methods, multilabel quality scoring leverages a decompose-then-aggregate strategy:

Decomposition: The K-class multi-label problem is treated as K independent binary classification problems. For each class k, the binary label (0 or 1) and the predicted probability for that class are extracted. The predicted probability is converted to a two-column format [1 - p, p] to create a standard binary classification probability matrix. An existing binary label quality scoring method (e.g., self-confidence, normalized margin, or confidence-weighted entropy) is then applied to each binary problem independently.

Aggregation: The K per-class quality scores for each example are combined into a single overall label quality score using an aggregation function. The aggregation must be sensitive to the worst-scoring class (since a single mislabeled class constitutes a label error) while remaining robust to the distribution of scores across classes.

The result is a single score per example in [0, 1], where lower scores indicate examples more likely to have at least one incorrect label.

Usage

Multilabel quality scoring is appropriate whenever you have a multi-label classification dataset (where labels are represented as binary indicator matrices) and want to identify examples that may have one or more incorrect class assignments. It requires out-of-sample predicted probabilities from a trained multi-label classifier.

Theoretical Basis

Binary Decomposition

Given multi-label labels Y of shape (N, K) and predicted probabilities P of shape (N, K), for each class k:

  • Binary labels: y_k = Y[:, k] (values in {0, 1})
  • Binary pred probs: p_k = [[1 - P[:, k], P[:, k]]] (shape N x 2)
  • Per-class score: s_k = scorer(y_k, p_k) (shape N)

This yields a score matrix S of shape (N, K) where S[i][k] is the quality score for example i's label on class k.

Scoring Methods

Three binary scoring methods are available:

Self-Confidence: The predicted probability assigned to the given label. For a binary problem with label y and predicted probability p, self_confidence = p if y = 1, or (1 - p) if y = 0.

Normalized Margin: The difference between the probability assigned to the given label and the maximum probability assigned to any other class, normalized to [0, 1].

Confidence-Weighted Entropy: An entropy-based measure that weights the predictive uncertainty by the model's confidence in the given label.

Aggregation Methods

Exponential Moving Average (EMA): The per-class scores are sorted in descending order, and the EMA is computed recursively:

EMA_1 = s_1 (largest score)

EMA_t = alpha * s_t + (1 - alpha) * EMA_{t-1}, for t = 2, ..., K

where alpha is a forgetting factor in [0, 1]. A higher alpha (default 0.8) places more weight on the lower-scoring classes, making the aggregate more sensitive to individual class errors. The default alpha of 2/(K+1) can be overridden.

Softmin: A differentiable approximation to the minimum function, computed as:

softmin(s) = sum_k(s_k * softmax((1 - s) / temperature)_k)

A lower temperature makes this closer to the hard minimum, while a higher temperature makes it closer to the mean. The default temperature is 0.1.

Cross-Validation for Predicted Probabilities

To obtain unbiased predicted probabilities required for scoring, the module supports cross-validated prediction. Multi-label datasets are stratified by treating each unique combination of binary labels as a distinct "class" for stratification purposes, ensuring that rare label combinations are represented in each fold.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment