Principle:Cleanlab Cleanlab Multilabel Dataset Health Analysis
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Data Quality, Multi-Label Classification |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Multilabel dataset health analysis assesses the overall quality of annotations in a multi-label classification dataset by decomposing the problem into independent per-class binary assessments and aggregating the results into interpretable quality metrics.
Description
In multi-label classification, each example can belong to zero or more classes simultaneously. This makes label error detection fundamentally different from multi-class classification, where each example belongs to exactly one class. The key insight behind multilabel dataset health analysis is the one-vs-rest decomposition: by treating each of the K classes as an independent binary classification problem ("belongs to class k" vs. "does not belong to class k"), existing confident learning techniques for binary classification can be applied to each class independently.
This decomposition enables several levels of analysis:
- Per-class issue counting: For each class, count how many examples are labeled as belonging to the class but likely should not (false positives), and how many are missing the class label but likely should have it (false negatives). This yields a directional view of the most common mislabeling patterns.
- Per-class quality scoring: For each class, compute the label noise rate (fraction of examples labeled with the class that are mislabeled) and the inverse label noise rate (fraction of examples that should have the class but do not). A label quality score of 1 minus the noise rate summarizes overall reliability for each class.
- Overall health scoring: Aggregate across all examples by computing the fraction of examples that have no label issues detected for any class, yielding a single score between 0 and 1.
This hierarchical approach allows practitioners to quickly identify whether a dataset has systemic annotation problems, pinpoint which classes are most affected, and understand the directionality of errors (over-annotation vs. under-annotation).
Usage
Use multilabel dataset health analysis when you have a multi-label classification dataset and want to:
- Assess whether the dataset has sufficient label quality for model training.
- Identify which classes have the most annotation errors and need re-annotation.
- Understand whether specific classes are being systematically over-annotated or under-annotated.
- Generate a concise health report to share with data annotation teams.
This analysis requires model-predicted class probabilities (preferably obtained via cross-validation for out-of-sample estimates) along with the given labels.
Theoretical Basis
The analysis rests on three core principles:
1. One-vs-Rest Decomposition:
Given K classes, the multi-label problem is decomposed into K independent binary classification problems. For each class k, a binary label vector is constructed:
y_k[i] = 1 if class k is in labels[i], else 0
And a binary predicted probability matrix is constructed from the k-th column of pred_probs.
2. Confident Learning for Issue Detection:
For each binary subproblem, confident learning methods (such as prune-by-noise-rate) identify examples where the given binary label disagrees with the model's confident prediction. The confident joint, a (2 x 2) matrix for each class, captures the estimated counts of (given label, true label) pairs.
3. Aggregation into Quality Metrics:
Per-class metrics are computed as follows:
Label Noise(k) = (number of false positives for class k) / (total number of examples) Inverse Label Noise(k) = (number of false negatives for class k) / (total number of examples) Label Quality Score(k) = 1 - Label Noise(k)
The overall health score aggregates at the example level:
Overall Health Score = 1 - (number of examples with any label issue) / N
where an example is flagged if any of its K binary labels are estimated to be incorrect.