Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Cleanlab Cleanlab Datalab Dataset Audit

From Leeroopedia


Knowledge Sources
Domains Data_Centric_AI, Dataset_Quality, Automated_Auditing
Last Updated 2026-02-09 19:00 GMT

Overview

End-to-end process for automatically auditing a dataset for multiple types of quality issues using cleanlab's high-level Datalab API.

Description

This workflow uses cleanlab's Datalab class to perform a comprehensive, automated audit of dataset quality. Unlike the low-level API that focuses solely on label issues, Datalab detects multiple issue types simultaneously: label errors, outliers, near-duplicates, class imbalance, non-IID ordering, null values, underperforming data groups, and data valuation. It accepts feature embeddings and/or predicted probabilities, automatically determines which issue checks are possible given the available inputs, and produces a unified report with per-example issue scores and dataset-level statistics.

Usage

Execute this workflow when you want a comprehensive, one-stop audit of your dataset's quality. This is the recommended approach for most users because it automatically detects all applicable issue types without requiring manual configuration. It supports classification (binary and multi-class), multi-label classification, and regression tasks. The Datalab workflow is appropriate for text, image, tabular, or audio datasets, as long as you can provide feature embeddings and/or predicted probabilities from a trained model.

Execution Steps

Step 1: Prepare Dataset and Model Outputs

Load your dataset into a format accepted by Datalab (pandas DataFrame, HuggingFace Dataset, dict, or list of dicts). Ensure it includes a label column. Separately, obtain feature embeddings and/or out-of-sample predicted probabilities from a trained model. The more inputs you provide, the more issue types Datalab can detect.

Key considerations:

  • The dataset must include labels identified by a label_name column
  • Feature embeddings enable detection of outliers, duplicates, non-IID issues, and underperforming groups
  • Predicted probabilities enable detection of label issues
  • Providing both features and pred_probs enables the most comprehensive audit
  • Supported task types: "classification", "regression", "multilabel"

Step 2: Initialize Datalab

Create a Datalab instance by passing your dataset and specifying the label column name and task type. Datalab wraps the dataset internally, validates the label format, and prepares the internal data structures for issue detection.

Key considerations:

  • The task parameter defaults to "classification" but should be set explicitly for regression or multilabel data
  • An optional image_key parameter enables integration with CleanVision for image quality checks
  • Datalab internally converts all dataset formats to a HuggingFace Dataset for uniform handling

Step 3: Find Issues

Call the find_issues method with your model outputs (pred_probs, features, or both). Datalab's internal IssueFinder orchestrator determines which issue types are applicable given the provided inputs, instantiates the corresponding IssueManager for each type, and runs all checks. Each IssueManager computes per-example scores and flags for its issue type.

Key considerations:

  • Issue types are automatically selected based on available inputs
  • You can explicitly specify which issue types to check via the issue_types parameter
  • A pre-computed KNN graph can be passed to avoid redundant computation across issue types
  • Cluster IDs can be provided for underperforming group detection
  • The method stores all results internally for later retrieval

Step 4: Review the Report

Call the report method to generate a human-readable summary of all detected issues. The report ranks issue types by severity, shows the number of examples flagged for each type, and provides dataset-level statistics. This gives a bird's-eye view of data quality.

Key considerations:

  • Issues are ranked by prevalence and severity
  • The report includes per-issue-type statistics and overall dataset health metrics
  • For large datasets, the report highlights the most impactful issues first
  • The verbosity of the report can be controlled

Step 5: Retrieve and Act on Issues

Use get_issues to retrieve per-example issue flags and scores as a DataFrame. Each row corresponds to a dataset example, and columns indicate whether the example was flagged for each issue type along with its quality score. Use these results to filter, fix, or remove problematic examples before retraining your model.

Key considerations:

  • Filter by specific issue type to focus on label errors, outliers, or duplicates
  • Sort by score columns to prioritize the worst examples for review
  • The get_issue_summary method provides aggregated statistics
  • Datalab objects can be saved and loaded for reproducibility via save/load methods

Execution Diagram

GitHub URL

Workflow Repository