Principle:Cleanlab Cleanlab Issue Reporting
| Metadata | |
|---|---|
| Sources | Cleanlab, Cleanlab Docs |
| Domains | Data_Quality, Dataset_Auditing, Reporting |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Mechanism for generating human-readable summaries of detected dataset quality issues, with configurable verbosity and detail levels.
Description
Issue reporting transforms the raw numerical results from automated issue detection into an interpretable text report. The report is printed to stdout and is designed for dataset curators to quickly understand what quality problems exist in their data.
The report includes:
- Issue type counts: How many examples in the dataset are flagged for each type of issue.
- Severity ranking: Issue types are sorted by severity so the most critical problems appear first.
- Top-N problematic examples: For each issue type, the report displays the most severely affected examples with their scores.
- Issue descriptions: Optional plain-language explanations of what each issue type means and how to interpret it.
- Summary scores: Optional overall severity scores per issue type.
The level of detail shown is controlled by the verbosity parameter (0-4), allowing users to get a quick overview or a deep dive depending on their needs.
Usage
Use issue reporting after calling find_issues() to get a human-readable overview of all detected issues before diving into specific examples. This is typically the second step in a dataset audit workflow, providing an at-a-glance summary that guides subsequent investigation.
Theoretical Basis
Issue reporting applies standard report generation techniques to the audit results:
- Aggregation: Per-example issue flags are aggregated into counts and summaries per issue type, providing a dataset-level view of data quality.
- Severity sorting: Issue types are ranked by their overall severity scores (or number of flagged examples), so the most impactful problems are surfaced first.
- Top-N display: For each issue type, the top-N most problematic examples are shown with their severity scores. This gives curators concrete examples to inspect without overwhelming them with the full dataset.
- Verbosity levels: A configurable verbosity parameter controls the amount of detail shown. Lower verbosity shows only counts and top examples; higher verbosity adds descriptions, scores, and additional context.
Pseudocode
def report(data_issues, num_examples, verbosity, include_description):
summary = data_issues.issue_summary
summary = sort_by_severity(summary)
for issue_type in summary:
print(issue_type.name, issue_type.num_issues, "issues found")
if include_description:
print(issue_type.description)
# Show top-N most problematic examples
top_examples = get_top_issues(data_issues, issue_type, num_examples)
print(top_examples)