Principle:Cleanlab Cleanlab Automated Issue Detection

Metadata
Sources	Cleanlab, Cleanlab Docs
Domains	Data_Quality, Dataset_Auditing, Machine_Learning
Last Updated	2026-02-09 12:00 GMT

Overview

Automated pipeline that orchestrates multiple specialized issue detectors to find diverse quality problems in a dataset from a single API call.

Description

Automated issue detection runs a battery of specialized issue managers against the dataset. Each manager is responsible for detecting a specific type of data quality problem:

Label issues: Mislabeled examples detected via confident learning on predicted probabilities.
Outlier issues: Anomalous examples identified via distance to nearest neighbors in feature space.
Duplicate issues: Near-duplicate examples found via nearest neighbor similarity.
Non-IID issues: Statistical tests for whether the data ordering is non-random.
Null issues: Missing or null values in the dataset.
Class imbalance issues: Severely underrepresented classes.
Underperforming group issues: Subsets of data where the model performs poorly.
Data valuation issues: Examples that contribute negatively to model performance.

The IssueFinder orchestrator determines which managers can run given the available inputs (pred_probs, features, knn_graph) and executes them in sequence. All results are stored in a unified DataIssues container attached to the Datalab instance.

Usage

Use automated issue detection after initializing a Datalab instance with your dataset. Provide pred_probs (out-of-sample predicted probabilities from your model) and/or features (embedding vectors) to enable different types of issue detection. The more inputs you provide, the more comprehensive the audit will be.

Theoretical Basis

The automated issue detection system is built on a modular audit architecture:

Independent issue managers: Each issue type has an independent manager class that implements a find_issues() interface. This separation of concerns allows each detector to use specialized algorithms without coupling to other detectors.
Factory-based instantiation: The orchestrator uses a factory pattern to instantiate the set of available managers based on which inputs (pred_probs, features, knn_graph) were provided and which task type is configured. Managers that require unavailable inputs are automatically skipped.
Per-example scoring: Each manager independently computes two outputs per example: a boolean flag (is_{type}_issue) indicating whether the example has that issue, and a numeric severity score ({type}_score) between 0 and 1 where lower values indicate more severe issues.
Signal routing: Different inputs enable different detectors. For example, pred_probs enables label issue detection, features or knn_graph enables outlier and duplicate detection. Providing both enables the full suite of detectors.

Pseudocode

def find_issues(datalab, pred_probs, features, knn_graph, issue_types):
    # Determine which issue managers to run
    if issue_types is None:
        managers = get_default_managers(datalab.task)
    else:
        managers = get_requested_managers(issue_types)

    # Filter to managers whose required inputs are available
    available_managers = filter_by_available_inputs(
        managers, pred_probs, features, knn_graph
    )

    # Run each manager in sequence
    for manager in available_managers:
        manager.find_issues(
            pred_probs=pred_probs,
            features=features,
            knn_graph=knn_graph,
        )
        # Results stored in datalab.data_issues

    report_total_issues_found(datalab.data_issues)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment