Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Cleanlab Cleanlab Issue Retrieval

From Leeroopedia


Metadata
Sources Cleanlab, Cleanlab Docs
Domains Data_Quality, Dataset_Auditing, Data_Access
Last Updated 2026-02-09 12:00 GMT

Overview

Method for programmatically accessing per-example issue detection results as structured DataFrames for downstream analysis and action.

Description

Issue retrieval provides programmatic access to the raw results of automated issue detection. It returns a pandas DataFrame where each row corresponds to a dataset example and columns indicate whether each example has a particular type of issue and the corresponding severity score.

The returned DataFrame contains two columns per issue type:

  • is_{type}_issue (bool): Whether this example is flagged as having the given issue type.
  • {type}_score (float): A numeric quality score between 0 and 1, where lower values indicate more severe instances of the issue.

This structured output enables downstream programmatic operations:

  • Filtering: Select only examples with a specific issue type for manual review.
  • Sorting: Rank examples by severity score to prioritize the worst cases.
  • Export: Save results to CSV or other formats for external tools.
  • Conditional logic: Automatically exclude or relabel flagged examples in a data pipeline.

Usage

Use issue retrieval after calling find_issues() to get the raw issue detection results for programmatic analysis, filtering, or export. This is the primary interface for integrating Datalab audit results into automated data cleaning pipelines.

Theoretical Basis

Issue retrieval follows a structured data access pattern:

  1. Unified storage: The internal DataIssues container stores per-example boolean flags and numeric scores computed by each issue manager in a single DataFrame. This avoids fragmented results across multiple objects.
  2. Column naming convention: Standardized column names (is_{type}_issue and {type}_score) enable programmatic access without hardcoding issue-type-specific logic.
  3. Optional filtering: When an issue_name is specified, only columns relevant to that issue type are returned. When None is provided, the full DataFrame with all issue types is returned.
  4. Score comparability: Scores are comparable across examples within the same issue type (lower is worse), but are not comparable across different issue types because each uses a fundamentally different scoring methodology.

Pseudocode

def get_issues(data_issues, issue_name=None):
    if issue_name is None:
        return data_issues.issues  # Full DataFrame, all issue types
    else:
        validate(issue_name in known_issue_types)
        # Return only columns for the specified issue type
        cols = [f"is_{issue_name}_issue", f"{issue_name}_score"]
        return data_issues.issues[cols]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment