Principle:Cleanlab Cleanlab Batched Label Issue Detection
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Scalability, Machine Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Batched label issue detection is a streaming, memory-efficient approach to identifying mislabeled examples in classification datasets that are too large to process in a single pass.
Description
Standard label issue detection methods in cleanlab require loading the entire predicted probability matrix and label array into memory simultaneously, which becomes infeasible for datasets with millions of examples or thousands of classes. Batched label issue detection addresses this by decomposing the detection pipeline into two sequential passes over the data, each processing mini-batches:
Pass 1 -- Threshold Estimation: Per-class confident thresholds are estimated incrementally. Each batch contributes to a running weighted average of thresholds, where the weight for each class is the number of examples of that class observed in the batch. This produces the same thresholds as would be computed on the full dataset in a single pass.
Pass 2 -- Scoring and Counting: Using the thresholds from Pass 1, each batch of examples is scored for label quality, and the running count of detected label issues is updated. Examples whose predicted class differs from their given label, and whose prediction exceeds the confident threshold, are counted as label issues.
The final result closely approximates the standard non-batched approach with filter_by="low_self_confidence" and return_indices_ranked_by="self_confidence".
Usage
Use batched label issue detection when your dataset contains hundreds of thousands to millions of examples and loading the full predicted probability matrix into memory is not feasible. It is also appropriate when your data is stored in memory-mapped formats (numpy mmap, Zarr, HDF5) or when you want to evaluate label quality in a streaming fashion without materializing the entire dataset.
Theoretical Basis
Incremental Threshold Estimation
The confident threshold for class k, denoted t_k, is the average predicted probability assigned to class k among examples labeled as class k. In the batched approach, this is maintained as a weighted running average:
t_k = (n_k_prev * t_k_prev + n_k_batch * t_k_batch) / (n_k_prev + n_k_batch)
where n_k_prev is the number of examples of class k seen so far, t_k_prev is the current threshold estimate, n_k_batch is the count for the current batch, and t_k_batch is the threshold computed from the current batch alone. This produces an exact (not approximate) threshold as long as all data is seen once.
Label Quality Scoring
Each example receives a label quality score between 0 and 1, where lower scores indicate a higher likelihood of being mislabeled. By default, the self-confidence score is used, which is the model's predicted probability for the given label, adjusted by the confident threshold.
Issue Count Estimation
The number of label issues is estimated by counting examples that satisfy both conditions:
- The model's predicted class differs from the given label (pred_class != label).
- The predicted probability for the predicted class exceeds the confident threshold for that class (pred_prob[pred_class] >= t[pred_class]).
An optional off-diagonal calibrated estimation method applies a per-class calibration factor based on the ratio of class counts to normalization terms, improving accuracy when class distributions are imbalanced.
Multiprocessing
On Linux, the scoring and counting step can be parallelized across multiple processes using fork-based process pools. The shared data (thresholds, labels, pred_probs) is accessed via global variables that are inherited by child processes through fork semantics, avoiding data serialization overhead.