Principle:Cleanlab Cleanlab Automated Issue Detection
| Metadata | |
|---|---|
| Sources | Cleanlab, Cleanlab Docs |
| Domains | Data_Quality, Dataset_Auditing, Machine_Learning |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Automated pipeline that orchestrates multiple specialized issue detectors to find diverse quality problems in a dataset from a single API call.
Description
Automated issue detection runs a battery of specialized issue managers against the dataset. Each manager is responsible for detecting a specific type of data quality problem:
- Label issues: Mislabeled examples detected via confident learning on predicted probabilities.
- Outlier issues: Anomalous examples identified via distance to nearest neighbors in feature space.
- Duplicate issues: Near-duplicate examples found via nearest neighbor similarity.
- Non-IID issues: Statistical tests for whether the data ordering is non-random.
- Null issues: Missing or null values in the dataset.
- Class imbalance issues: Severely underrepresented classes.
- Underperforming group issues: Subsets of data where the model performs poorly.
- Data valuation issues: Examples that contribute negatively to model performance.
The IssueFinder orchestrator determines which managers can run given the available inputs (pred_probs, features, knn_graph) and executes them in sequence. All results are stored in a unified DataIssues container attached to the Datalab instance.
Usage
Use automated issue detection after initializing a Datalab instance with your dataset. Provide pred_probs (out-of-sample predicted probabilities from your model) and/or features (embedding vectors) to enable different types of issue detection. The more inputs you provide, the more comprehensive the audit will be.
Theoretical Basis
The automated issue detection system is built on a modular audit architecture:
- Independent issue managers: Each issue type has an independent manager class that implements a
find_issues()interface. This separation of concerns allows each detector to use specialized algorithms without coupling to other detectors. - Factory-based instantiation: The orchestrator uses a factory pattern to instantiate the set of available managers based on which inputs (pred_probs, features, knn_graph) were provided and which task type is configured. Managers that require unavailable inputs are automatically skipped.
- Per-example scoring: Each manager independently computes two outputs per example: a boolean flag (
is_{type}_issue) indicating whether the example has that issue, and a numeric severity score ({type}_score) between 0 and 1 where lower values indicate more severe issues. - Signal routing: Different inputs enable different detectors. For example,
pred_probsenables label issue detection,featuresorknn_graphenables outlier and duplicate detection. Providing both enables the full suite of detectors.
Pseudocode
def find_issues(datalab, pred_probs, features, knn_graph, issue_types):
# Determine which issue managers to run
if issue_types is None:
managers = get_default_managers(datalab.task)
else:
managers = get_requested_managers(issue_types)
# Filter to managers whose required inputs are available
available_managers = filter_by_available_inputs(
managers, pred_probs, features, knn_graph
)
# Run each manager in sequence
for manager in available_managers:
manager.find_issues(
pred_probs=pred_probs,
features=features,
knn_graph=knn_graph,
)
# Results stored in datalab.data_issues
report_total_issues_found(datalab.data_issues)