Workflow:Cleanlab Cleanlab Datalab Dataset Audit
| Knowledge Sources | |
|---|---|
| Domains | Data_Centric_AI, Dataset_Quality, Automated_Auditing |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
End-to-end process for automatically auditing a dataset for multiple types of quality issues using cleanlab's high-level Datalab API.
Description
This workflow uses cleanlab's Datalab class to perform a comprehensive, automated audit of dataset quality. Unlike the low-level API that focuses solely on label issues, Datalab detects multiple issue types simultaneously: label errors, outliers, near-duplicates, class imbalance, non-IID ordering, null values, underperforming data groups, and data valuation. It accepts feature embeddings and/or predicted probabilities, automatically determines which issue checks are possible given the available inputs, and produces a unified report with per-example issue scores and dataset-level statistics.
Usage
Execute this workflow when you want a comprehensive, one-stop audit of your dataset's quality. This is the recommended approach for most users because it automatically detects all applicable issue types without requiring manual configuration. It supports classification (binary and multi-class), multi-label classification, and regression tasks. The Datalab workflow is appropriate for text, image, tabular, or audio datasets, as long as you can provide feature embeddings and/or predicted probabilities from a trained model.
Execution Steps
Step 1: Prepare Dataset and Model Outputs
Load your dataset into a format accepted by Datalab (pandas DataFrame, HuggingFace Dataset, dict, or list of dicts). Ensure it includes a label column. Separately, obtain feature embeddings and/or out-of-sample predicted probabilities from a trained model. The more inputs you provide, the more issue types Datalab can detect.
Key considerations:
- The dataset must include labels identified by a label_name column
- Feature embeddings enable detection of outliers, duplicates, non-IID issues, and underperforming groups
- Predicted probabilities enable detection of label issues
- Providing both features and pred_probs enables the most comprehensive audit
- Supported task types: "classification", "regression", "multilabel"
Step 2: Initialize Datalab
Create a Datalab instance by passing your dataset and specifying the label column name and task type. Datalab wraps the dataset internally, validates the label format, and prepares the internal data structures for issue detection.
Key considerations:
- The task parameter defaults to "classification" but should be set explicitly for regression or multilabel data
- An optional image_key parameter enables integration with CleanVision for image quality checks
- Datalab internally converts all dataset formats to a HuggingFace Dataset for uniform handling
Step 3: Find Issues
Call the find_issues method with your model outputs (pred_probs, features, or both). Datalab's internal IssueFinder orchestrator determines which issue types are applicable given the provided inputs, instantiates the corresponding IssueManager for each type, and runs all checks. Each IssueManager computes per-example scores and flags for its issue type.
Key considerations:
- Issue types are automatically selected based on available inputs
- You can explicitly specify which issue types to check via the issue_types parameter
- A pre-computed KNN graph can be passed to avoid redundant computation across issue types
- Cluster IDs can be provided for underperforming group detection
- The method stores all results internally for later retrieval
Step 4: Review the Report
Call the report method to generate a human-readable summary of all detected issues. The report ranks issue types by severity, shows the number of examples flagged for each type, and provides dataset-level statistics. This gives a bird's-eye view of data quality.
Key considerations:
- Issues are ranked by prevalence and severity
- The report includes per-issue-type statistics and overall dataset health metrics
- For large datasets, the report highlights the most impactful issues first
- The verbosity of the report can be controlled
Step 5: Retrieve and Act on Issues
Use get_issues to retrieve per-example issue flags and scores as a DataFrame. Each row corresponds to a dataset example, and columns indicate whether the example was flagged for each issue type along with its quality score. Use these results to filter, fix, or remove problematic examples before retraining your model.
Key considerations:
- Filter by specific issue type to focus on label errors, outliers, or duplicates
- Sort by score columns to prioritize the worst examples for review
- The get_issue_summary method provides aggregated statistics
- Datalab objects can be saved and loaded for reproducibility via save/load methods