Principle:Cleanlab Cleanlab Issue Retrieval
| Metadata | |
|---|---|
| Sources | Cleanlab, Cleanlab Docs |
| Domains | Data_Quality, Dataset_Auditing, Data_Access |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Method for programmatically accessing per-example issue detection results as structured DataFrames for downstream analysis and action.
Description
Issue retrieval provides programmatic access to the raw results of automated issue detection. It returns a pandas DataFrame where each row corresponds to a dataset example and columns indicate whether each example has a particular type of issue and the corresponding severity score.
The returned DataFrame contains two columns per issue type:
is_{type}_issue(bool): Whether this example is flagged as having the given issue type.{type}_score(float): A numeric quality score between 0 and 1, where lower values indicate more severe instances of the issue.
This structured output enables downstream programmatic operations:
- Filtering: Select only examples with a specific issue type for manual review.
- Sorting: Rank examples by severity score to prioritize the worst cases.
- Export: Save results to CSV or other formats for external tools.
- Conditional logic: Automatically exclude or relabel flagged examples in a data pipeline.
Usage
Use issue retrieval after calling find_issues() to get the raw issue detection results for programmatic analysis, filtering, or export. This is the primary interface for integrating Datalab audit results into automated data cleaning pipelines.
Theoretical Basis
Issue retrieval follows a structured data access pattern:
- Unified storage: The internal
DataIssuescontainer stores per-example boolean flags and numeric scores computed by each issue manager in a single DataFrame. This avoids fragmented results across multiple objects. - Column naming convention: Standardized column names (
is_{type}_issueand{type}_score) enable programmatic access without hardcoding issue-type-specific logic. - Optional filtering: When an
issue_nameis specified, only columns relevant to that issue type are returned. WhenNoneis provided, the full DataFrame with all issue types is returned. - Score comparability: Scores are comparable across examples within the same issue type (lower is worse), but are not comparable across different issue types because each uses a fundamentally different scoring methodology.
Pseudocode
def get_issues(data_issues, issue_name=None):
if issue_name is None:
return data_issues.issues # Full DataFrame, all issue types
else:
validate(issue_name in known_issue_types)
# Return only columns for the specified issue type
cols = [f"is_{issue_name}_issue", f"{issue_name}_score"]
return data_issues.issues[cols]