Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer OverallAnalysis Analyze

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Statistics, Data_Quality
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for computing aggregate descriptive statistics across dataset quality metrics provided by the Data-Juicer framework.

Description

The OverallAnalysis class takes a dataset with computed statistics columns and produces a pandas DataFrame with descriptive statistics (mean, std, min, max, quantiles) per column. Results are exported to overall.csv and overall.md in the analysis output directory. It supports parallel analysis via multiprocessing and configurable percentile points.

Usage

Use after computing per-sample statistics in the analysis pipeline. Pass the dataset and output path, then call analyze() to generate the summary.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/analysis/overall_analysis.py
  • Lines: L16-111

Signature

class OverallAnalysis:
    def __init__(self, dataset, output_path):
        """
        Args:
            dataset: Dataset with __dj__stats__ columns.
            output_path: Directory for output files.
        """

    def analyze(
        self,
        percentiles=[],
        num_proc=1,
        skip_export=False
    ) -> pd.DataFrame:
        """
        Compute descriptive statistics for all stat columns.

        Args:
            percentiles: Custom quantile points (default: [0.25, 0.5, 0.75]).
            num_proc: Number of parallel analysis processes.
            skip_export: If True, skip writing files.

        Returns:
            DataFrame with per-column descriptive statistics.
        """

Import

from data_juicer.analysis.overall_analysis import OverallAnalysis

I/O Contract

Inputs

Name Type Required Description
dataset Dataset Yes Dataset with __dj__stats__ columns
output_path str Yes (init) Directory for output files
percentiles list No Custom quantile points
num_proc int No Parallel analysis workers (default: 1)

Outputs

Name Type Description
result pd.DataFrame Descriptive statistics (mean, std, min, max, quantiles) per column
overall.csv File CSV export of results
overall.md File Markdown export of results

Usage Examples

Basic Overall Analysis

from data_juicer.analysis.overall_analysis import OverallAnalysis

analysis = OverallAnalysis(dataset_with_stats, './analysis_output/')
result_df = analysis.analyze(
    percentiles=[0.1, 0.25, 0.5, 0.75, 0.9],
    num_proc=4
)
print(result_df)
#              mean       std    min    max    10%    25%    50%    75%    90%
# text_len   450.2    123.5     10   5000   120.0  280.0  420.0  590.0  780.0
# lang_score   0.92     0.08   0.1    1.0     0.8    0.89   0.95   0.98   0.99

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment