Heuristic:Cleanlab Cleanlab Label Quality Scoring Method Selection
| Knowledge Sources | |
|---|---|
| Domains | Label_Quality, Confident_Learning |
| Last Updated | 2026-02-09 19:30 GMT |
Overview
Guide for choosing among the three label quality scoring methods (`self_confidence`, `normalized_margin`, `confidence_weighted_entropy`) based on the type of label error you want to detect.
Description
Cleanlab provides three distinct methods for scoring how likely each example's label is correct. Each method captures a different aspect of the model's predicted probability distribution, making them better suited for different error types. The default is `self_confidence`, but switching to `normalized_margin` can be more effective for class-conditional errors.
Usage
Use this heuristic when configuring get_label_quality_scores or find_label_issues with the `method` or `return_indices_ranked_by` parameters. Choose the method based on whether you are looking for class-conditional mislabeling or out-of-distribution anomalies.
The Insight (Rule of Thumb)
- Action: Select the scoring method based on error type:
- Use `normalized_margin` for class-conditional label errors (an example labeled as class A that should be class B)
- Use `self_confidence` for alternative label issues including out-of-distribution examples, ambiguous examples described by 2+ classes, and anomalous outliers
- Use `confidence_weighted_entropy` for a combined signal that weights prediction uncertainty by confidence
- Value: Default is `self_confidence` (the model's predicted probability for the given label, i.e., `P[k]`)
- Trade-off: `normalized_margin` (= `P[k] - max(P[k' != k])`) ignores the absolute confidence level and focuses on the gap between top-two classes. `self_confidence` (= `P[k]`) captures absolute certainty but may miss cases where the model is slightly more confident in another class. `confidence_weighted_entropy` (= `entropy(P) / self_confidence`) penalizes high-entropy predictions.
Reasoning
The distinction between these methods reflects a fundamental trade-off in label error detection:
normalized_margin computes the difference between the probability of the given label and the highest probability of any other class. This is sensitive to cases where the model "almost" prefers a different class, which is the hallmark of a mislabeled example that belongs to a specific other class.
self_confidence simply uses the model's probability for the given label. Examples where the model assigns low probability to the given label are flagged. This catches not only mislabeling but also outliers and ambiguous examples where no class fits well.
The choice of `self_confidence` as the default is deliberate: it is the most general-purpose method that catches the broadest range of issues.
Code Evidence:
Method definitions from `cleanlab/rank.py:69-91`:
method : {"self_confidence", "normalized_margin", "confidence_weighted_entropy"},
default="self_confidence"
- 'normalized_margin': P[k] - max_{k' != k}[ P[k'] ]
- 'self_confidence': P[k]
- 'confidence_weighted_entropy': entropy(P) / self_confidence
The `normalized_margin` score works better for identifying class
conditional label errors, i.e. examples for which another label
in C is appropriate but the given label is not.
The `self_confidence` score works better for identifying alternative
label issues corresponding to bad examples that are: not from any
of the classes in C, well-described by 2 or more labels in C,
or generally just out-of-distribution (i.e. anomalous outliers).
Temperature search for confidence_weighted_entropy from `cleanlab/rank.py:166`:
log_loss_search_T_values: List[float] = [1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 2e2]