Implementation:Cleanlab Cleanlab Find Label Issues Batched
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Machine Learning, Scalability |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Memory-efficient mini-batch implementation of label issue detection for large-scale classification datasets.
Description
The label_issues_batched module provides a batched variant of cleanlab's core find_label_issues functionality. It consists of the find_label_issues_batched convenience function and the underlying LabelInspector class. The approach processes data in configurable mini-batches using a two-pass streaming algorithm: the first pass incrementally estimates per-class confident thresholds via weighted averaging, and the second pass scores label quality and counts estimated label issues using those thresholds. With default settings, results closely approximate those of cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence"). The module supports memory-mapped data sources (numpy mmap, Zarr arrays), multiprocessing on Linux via fork-based pools, and both standard and "off_diagonal_calibrated" estimation methods.
Usage
Use this module when your dataset is too large to fit in memory for standard cleanlab label issue detection. Pass labels and predicted probabilities as memory-mapped arrays or file paths to .npy files, and configure the batch_size to the largest value your RAM allows.
Code Reference
Source Location
- Repository: Cleanlab
- File: cleanlab/experimental/label_issues_batched.py
- Lines: 1-761
Signature
def find_label_issues_batched(
labels: Optional[LabelLike] = None,
pred_probs: Optional[np.ndarray] = None,
*,
labels_file: Optional[str] = None,
pred_probs_file: Optional[str] = None,
batch_size: int = 10000,
n_jobs: Optional[int] = 1,
verbose: bool = True,
quality_score_kwargs: Optional[dict] = None,
num_issue_kwargs: Optional[dict] = None,
return_mask: bool = False,
) -> np.ndarray
class LabelInspector:
def __init__(
self,
*,
num_class: int,
store_results: bool = True,
verbose: bool = True,
quality_score_kwargs: Optional[dict] = None,
num_issue_kwargs: Optional[dict] = None,
n_jobs: Optional[int] = 1,
)
Import
from cleanlab.experimental.label_issues_batched import find_label_issues_batched
from cleanlab.experimental.label_issues_batched import LabelInspector
I/O Contract
Inputs (find_label_issues_batched)
| Name | Type | Required | Description |
|---|---|---|---|
| labels | np.ndarray-like | No | 1D array of class labels (int) in 0, 1, ..., K-1. Can be a memory-mapped object. Must provide either this or labels_file. |
| pred_probs | np.ndarray-like | No | 2D array of model-predicted class probabilities. Can be a memory-mapped object. Must provide either this or pred_probs_file. |
| labels_file | str | No | Path to .npy file containing the labels array, loaded via mmap. |
| pred_probs_file | str | No | Path to .npy file containing the pred_probs array, loaded via mmap. |
| batch_size | int | No | Size of mini-batches. Default 10000. Use the largest value your memory allows. |
| n_jobs | int | No | Number of processes for multiprocessing (Linux only). Default 1. |
| verbose | bool | No | Whether to display progress bars and print statements. Default True. |
| quality_score_kwargs | dict | No | Keyword arguments passed to rank.get_label_quality_scores. |
| num_issue_kwargs | dict | No | Keyword arguments to control num_label_issues estimation (e.g., estimation_method). |
| return_mask | bool | No | If True, returns a boolean mask; if False, returns sorted indices. Default False. |
Outputs
| Name | Type | Description |
|---|---|---|
| label_issues | np.ndarray | If return_mask is False: array of indices of examples with label issues, sorted by label quality score (most severe first). If return_mask is True: boolean mask where True indicates a label issue. |
Key Methods (LabelInspector)
| Method | Description |
|---|---|
| update_confident_thresholds(labels, pred_probs) | Incrementally updates per-class confident thresholds from a batch of data. |
| score_label_quality(labels, pred_probs) | Scores label quality for a batch and updates the running issue count. Returns per-example scores. |
| get_confident_thresholds() | Returns the current estimated confident thresholds array of shape (K,). |
| get_num_issues() | Returns the estimated total number of label issues seen so far. |
| get_quality_scores() | Returns all accumulated label quality scores as a 1D array. |
| get_label_issues() | Returns indices of examples with label issues, sorted by quality score. |
Usage Examples
Basic Usage: From .npy Files
import numpy as np
from cleanlab.experimental.label_issues_batched import find_label_issues_batched
# Save your existing arrays to .npy files
np.save("labels.npy", labels_array)
np.save("pred_probs.npy", pred_probs_array)
# Find label issues with batched processing
issue_indices = find_label_issues_batched(
labels_file="labels.npy",
pred_probs_file="pred_probs.npy",
batch_size=10000,
)
print(f"Found {len(issue_indices)} label issues")
Advanced Usage: LabelInspector Class
import numpy as np
from cleanlab.experimental.label_issues_batched import LabelInspector
labels = np.load("labels.npy", mmap_mode="r")
pred_probs = np.load("pred_probs.npy", mmap_mode="r")
n = len(labels)
batch_size = 10000
lab = LabelInspector(num_class=pred_probs.shape[1])
# Pass 1: Estimate confident thresholds
i = 0
while i < n:
end = i + batch_size
lab.update_confident_thresholds(labels[i:end], pred_probs[i:end, :])
i = end
# Pass 2: Score label quality
i = 0
while i < n:
end = i + batch_size
lab.score_label_quality(labels[i:end], pred_probs[i:end, :])
i = end
# Retrieve results
issue_indices = lab.get_label_issues()