Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Cleanlab Cleanlab Find Label Issues Batched

From Leeroopedia


Knowledge Sources
Domains Data Quality, Machine Learning, Scalability
Last Updated 2026-02-09 00:00 GMT

Overview

Memory-efficient mini-batch implementation of label issue detection for large-scale classification datasets.

Description

The label_issues_batched module provides a batched variant of cleanlab's core find_label_issues functionality. It consists of the find_label_issues_batched convenience function and the underlying LabelInspector class. The approach processes data in configurable mini-batches using a two-pass streaming algorithm: the first pass incrementally estimates per-class confident thresholds via weighted averaging, and the second pass scores label quality and counts estimated label issues using those thresholds. With default settings, results closely approximate those of cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence"). The module supports memory-mapped data sources (numpy mmap, Zarr arrays), multiprocessing on Linux via fork-based pools, and both standard and "off_diagonal_calibrated" estimation methods.

Usage

Use this module when your dataset is too large to fit in memory for standard cleanlab label issue detection. Pass labels and predicted probabilities as memory-mapped arrays or file paths to .npy files, and configure the batch_size to the largest value your RAM allows.

Code Reference

Source Location

  • Repository: Cleanlab
  • File: cleanlab/experimental/label_issues_batched.py
  • Lines: 1-761

Signature

def find_label_issues_batched(
    labels: Optional[LabelLike] = None,
    pred_probs: Optional[np.ndarray] = None,
    *,
    labels_file: Optional[str] = None,
    pred_probs_file: Optional[str] = None,
    batch_size: int = 10000,
    n_jobs: Optional[int] = 1,
    verbose: bool = True,
    quality_score_kwargs: Optional[dict] = None,
    num_issue_kwargs: Optional[dict] = None,
    return_mask: bool = False,
) -> np.ndarray
class LabelInspector:
    def __init__(
        self,
        *,
        num_class: int,
        store_results: bool = True,
        verbose: bool = True,
        quality_score_kwargs: Optional[dict] = None,
        num_issue_kwargs: Optional[dict] = None,
        n_jobs: Optional[int] = 1,
    )

Import

from cleanlab.experimental.label_issues_batched import find_label_issues_batched
from cleanlab.experimental.label_issues_batched import LabelInspector

I/O Contract

Inputs (find_label_issues_batched)

Name Type Required Description
labels np.ndarray-like No 1D array of class labels (int) in 0, 1, ..., K-1. Can be a memory-mapped object. Must provide either this or labels_file.
pred_probs np.ndarray-like No 2D array of model-predicted class probabilities. Can be a memory-mapped object. Must provide either this or pred_probs_file.
labels_file str No Path to .npy file containing the labels array, loaded via mmap.
pred_probs_file str No Path to .npy file containing the pred_probs array, loaded via mmap.
batch_size int No Size of mini-batches. Default 10000. Use the largest value your memory allows.
n_jobs int No Number of processes for multiprocessing (Linux only). Default 1.
verbose bool No Whether to display progress bars and print statements. Default True.
quality_score_kwargs dict No Keyword arguments passed to rank.get_label_quality_scores.
num_issue_kwargs dict No Keyword arguments to control num_label_issues estimation (e.g., estimation_method).
return_mask bool No If True, returns a boolean mask; if False, returns sorted indices. Default False.

Outputs

Name Type Description
label_issues np.ndarray If return_mask is False: array of indices of examples with label issues, sorted by label quality score (most severe first). If return_mask is True: boolean mask where True indicates a label issue.

Key Methods (LabelInspector)

Method Description
update_confident_thresholds(labels, pred_probs) Incrementally updates per-class confident thresholds from a batch of data.
score_label_quality(labels, pred_probs) Scores label quality for a batch and updates the running issue count. Returns per-example scores.
get_confident_thresholds() Returns the current estimated confident thresholds array of shape (K,).
get_num_issues() Returns the estimated total number of label issues seen so far.
get_quality_scores() Returns all accumulated label quality scores as a 1D array.
get_label_issues() Returns indices of examples with label issues, sorted by quality score.

Usage Examples

Basic Usage: From .npy Files

import numpy as np
from cleanlab.experimental.label_issues_batched import find_label_issues_batched

# Save your existing arrays to .npy files
np.save("labels.npy", labels_array)
np.save("pred_probs.npy", pred_probs_array)

# Find label issues with batched processing
issue_indices = find_label_issues_batched(
    labels_file="labels.npy",
    pred_probs_file="pred_probs.npy",
    batch_size=10000,
)
print(f"Found {len(issue_indices)} label issues")

Advanced Usage: LabelInspector Class

import numpy as np
from cleanlab.experimental.label_issues_batched import LabelInspector

labels = np.load("labels.npy", mmap_mode="r")
pred_probs = np.load("pred_probs.npy", mmap_mode="r")
n = len(labels)
batch_size = 10000

lab = LabelInspector(num_class=pred_probs.shape[1])

# Pass 1: Estimate confident thresholds
i = 0
while i < n:
    end = i + batch_size
    lab.update_confident_thresholds(labels[i:end], pred_probs[i:end, :])
    i = end

# Pass 2: Score label quality
i = 0
while i < n:
    end = i + batch_size
    lab.score_label_quality(labels[i:end], pred_probs[i:end, :])
    i = end

# Retrieve results
issue_indices = lab.get_label_issues()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment