Implementation:Cleanlab Cleanlab Find Label Issues Batched

Knowledge Sources	Cleanlab
Domains	Data Quality, Machine Learning, Scalability
Last Updated	2026-02-09 00:00 GMT

Overview

Memory-efficient mini-batch implementation of label issue detection for large-scale classification datasets.

Description

The label_issues_batched module provides a batched variant of cleanlab's core find_label_issues functionality. It consists of the find_label_issues_batched convenience function and the underlying LabelInspector class. The approach processes data in configurable mini-batches using a two-pass streaming algorithm: the first pass incrementally estimates per-class confident thresholds via weighted averaging, and the second pass scores label quality and counts estimated label issues using those thresholds. With default settings, results closely approximate those of cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence"). The module supports memory-mapped data sources (numpy mmap, Zarr arrays), multiprocessing on Linux via fork-based pools, and both standard and "off_diagonal_calibrated" estimation methods.

Usage

Use this module when your dataset is too large to fit in memory for standard cleanlab label issue detection. Pass labels and predicted probabilities as memory-mapped arrays or file paths to .npy files, and configure the batch_size to the largest value your RAM allows.

Code Reference

Source Location

Repository: Cleanlab
File: cleanlab/experimental/label_issues_batched.py
Lines: 1-761

Signature

def find_label_issues_batched(
    labels: Optional[LabelLike] = None,
    pred_probs: Optional[np.ndarray] = None,
    *,
    labels_file: Optional[str] = None,
    pred_probs_file: Optional[str] = None,
    batch_size: int = 10000,
    n_jobs: Optional[int] = 1,
    verbose: bool = True,
    quality_score_kwargs: Optional[dict] = None,
    num_issue_kwargs: Optional[dict] = None,
    return_mask: bool = False,
) -> np.ndarray

class LabelInspector:
    def __init__(
        self,
        *,
        num_class: int,
        store_results: bool = True,
        verbose: bool = True,
        quality_score_kwargs: Optional[dict] = None,
        num_issue_kwargs: Optional[dict] = None,
        n_jobs: Optional[int] = 1,
    )

Import

from cleanlab.experimental.label_issues_batched import find_label_issues_batched
from cleanlab.experimental.label_issues_batched import LabelInspector

I/O Contract

Inputs (find_label_issues_batched)

Name	Type	Required	Description
labels	np.ndarray-like	No	1D array of class labels (int) in 0, 1, ..., K-1. Can be a memory-mapped object. Must provide either this or labels_file.
pred_probs	np.ndarray-like	No	2D array of model-predicted class probabilities. Can be a memory-mapped object. Must provide either this or pred_probs_file.
labels_file	str	No	Path to .npy file containing the labels array, loaded via mmap.
pred_probs_file	str	No	Path to .npy file containing the pred_probs array, loaded via mmap.
batch_size	int	No	Size of mini-batches. Default 10000. Use the largest value your memory allows.
n_jobs	int	No	Number of processes for multiprocessing (Linux only). Default 1.
verbose	bool	No	Whether to display progress bars and print statements. Default True.
quality_score_kwargs	dict	No	Keyword arguments passed to rank.get_label_quality_scores.
num_issue_kwargs	dict	No	Keyword arguments to control num_label_issues estimation (e.g., estimation_method).
return_mask	bool	No	If True, returns a boolean mask; if False, returns sorted indices. Default False.

Outputs

Name	Type	Description
label_issues	np.ndarray	If return_mask is False: array of indices of examples with label issues, sorted by label quality score (most severe first). If return_mask is True: boolean mask where True indicates a label issue.

Key Methods (LabelInspector)

Method	Description
update_confident_thresholds(labels, pred_probs)	Incrementally updates per-class confident thresholds from a batch of data.
score_label_quality(labels, pred_probs)	Scores label quality for a batch and updates the running issue count. Returns per-example scores.
get_confident_thresholds()	Returns the current estimated confident thresholds array of shape (K,).
get_num_issues()	Returns the estimated total number of label issues seen so far.
get_quality_scores()	Returns all accumulated label quality scores as a 1D array.
get_label_issues()	Returns indices of examples with label issues, sorted by quality score.

Usage Examples

Basic Usage: From .npy Files

import numpy as np
from cleanlab.experimental.label_issues_batched import find_label_issues_batched

# Save your existing arrays to .npy files
np.save("labels.npy", labels_array)
np.save("pred_probs.npy", pred_probs_array)

# Find label issues with batched processing
issue_indices = find_label_issues_batched(
    labels_file="labels.npy",
    pred_probs_file="pred_probs.npy",
    batch_size=10000,
)
print(f"Found {len(issue_indices)} label issues")

Advanced Usage: LabelInspector Class

import numpy as np
from cleanlab.experimental.label_issues_batched import LabelInspector

labels = np.load("labels.npy", mmap_mode="r")
pred_probs = np.load("pred_probs.npy", mmap_mode="r")
n = len(labels)
batch_size = 10000

lab = LabelInspector(num_class=pred_probs.shape[1])

# Pass 1: Estimate confident thresholds
i = 0
while i < n:
    end = i + batch_size
    lab.update_confident_thresholds(labels[i:end], pred_probs[i:end, :])
    i = end

# Pass 2: Score label quality
i = 0
while i < n:
    end = i + batch_size
    lab.score_label_quality(labels[i:end], pred_probs[i:end, :])
    i = end

# Retrieve results
issue_indices = lab.get_label_issues()

Related Pages

Principle:Cleanlab_Cleanlab_Batched_Label_Issue_Detection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment