Implementation:Datajuicer Data juicer OverallAnalysis Analyze
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Statistics, Data_Quality |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for computing aggregate descriptive statistics across dataset quality metrics provided by the Data-Juicer framework.
Description
The OverallAnalysis class takes a dataset with computed statistics columns and produces a pandas DataFrame with descriptive statistics (mean, std, min, max, quantiles) per column. Results are exported to overall.csv and overall.md in the analysis output directory. It supports parallel analysis via multiprocessing and configurable percentile points.
Usage
Use after computing per-sample statistics in the analysis pipeline. Pass the dataset and output path, then call analyze() to generate the summary.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/analysis/overall_analysis.py
- Lines: L16-111
Signature
class OverallAnalysis:
def __init__(self, dataset, output_path):
"""
Args:
dataset: Dataset with __dj__stats__ columns.
output_path: Directory for output files.
"""
def analyze(
self,
percentiles=[],
num_proc=1,
skip_export=False
) -> pd.DataFrame:
"""
Compute descriptive statistics for all stat columns.
Args:
percentiles: Custom quantile points (default: [0.25, 0.5, 0.75]).
num_proc: Number of parallel analysis processes.
skip_export: If True, skip writing files.
Returns:
DataFrame with per-column descriptive statistics.
"""
Import
from data_juicer.analysis.overall_analysis import OverallAnalysis
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | Dataset | Yes | Dataset with __dj__stats__ columns |
| output_path | str | Yes (init) | Directory for output files |
| percentiles | list | No | Custom quantile points |
| num_proc | int | No | Parallel analysis workers (default: 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| result | pd.DataFrame | Descriptive statistics (mean, std, min, max, quantiles) per column |
| overall.csv | File | CSV export of results |
| overall.md | File | Markdown export of results |
Usage Examples
Basic Overall Analysis
from data_juicer.analysis.overall_analysis import OverallAnalysis
analysis = OverallAnalysis(dataset_with_stats, './analysis_output/')
result_df = analysis.analyze(
percentiles=[0.1, 0.25, 0.5, 0.75, 0.9],
num_proc=4
)
print(result_df)
# mean std min max 10% 25% 50% 75% 90%
# text_len 450.2 123.5 10 5000 120.0 280.0 420.0 590.0 780.0
# lang_score 0.92 0.08 0.1 1.0 0.8 0.89 0.95 0.98 0.99