Principle:Cleanlab Cleanlab Dataset Health Analysis
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Quality |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Comprehensive dataset-level diagnostic that summarizes overall label quality, identifies problematic classes, and detects potentially overlapping class definitions.
Description
Dataset health analysis provides a bird's-eye view of label quality across the entire dataset, going beyond individual example-level label issue detection. It computes an overall health score representing the estimated fraction of correctly labeled examples, ranks all classes by their label quality (highlighting which classes suffer from the most labeling errors), and identifies pairs of classes that are frequently confused with each other (potentially indicating overlapping or ambiguous class definitions).
This holistic perspective helps dataset curators understand systemic label quality issues rather than just individual errors. For example, a dataset might have an overall health score of 0.85 (15% of labels estimated to be wrong), with class "cat" being the worst-quality class due to frequent confusion with class "dog", suggesting these two classes need clearer annotation guidelines.
Usage
Use when you want a high-level summary of your dataset's label quality to understand the overall severity and distribution of label issues. This is typically used as a first step in a dataset quality audit before diving into individual label issues. It answers questions like: "How noisy is my dataset overall?", "Which classes are the most problematic?", and "Are there class pairs that annotators cannot reliably distinguish?"
Theoretical Basis
From the estimated joint distribution J of (given_label, true_label), the following dataset-level metrics are computed:
Overall health score:
health_score = trace(J) / sum(J)
This is the fraction of examples estimated to be correctly labeled (where given_label == true_label), i.e., the sum of diagonal entries of J divided by the total count.
Per-class label noise:
label_noise[k] = 1 - J[k][k] / sum(J[k, :])
For each class k, this is the fraction of examples labeled as k that are estimated to actually belong to a different class.
Per-class inverse label noise:
inverse_label_noise[k] = 1 - J[k][k] / sum(J[:, k])
For each class k, this is the fraction of examples whose true label is k but that received a different given label.
Per-class label quality score:
quality_score[k] = 1 - label_noise[k] = J[k][k] / sum(J[k, :])
Overlapping class pairs:
For each pair of classes (i, j), the degree of overlap is estimated from the off-diagonal entries J[i][j] + J[j][i], normalized by the total counts of both classes. Class pairs with high overlap are flagged as potentially having ambiguous or overlapping definitions.