Principle:Datajuicer Data juicer Overall Statistical Analysis
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Statistics, Data_Quality |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A descriptive statistics aggregation technique that computes summary metrics (mean, standard deviation, quantiles) across all samples in a dataset to profile data quality.
Description
Overall Statistical Analysis computes aggregate descriptive statistics for each quality metric column in a dataset. Given a dataset with per-sample statistics (e.g., text_len, lang_score, perplexity), it calculates the mean, standard deviation, minimum, maximum, and configurable quantiles (default: 25th, 50th, 75th percentiles) for each metric. The result is a summary table that gives a high-level view of dataset quality distribution, enabling users to identify outliers, set filter thresholds, and compare dataset versions.
Usage
Use this principle after computing per-sample statistics to get a summary overview of dataset quality. It is the first analysis step in the Dataset Quality Analysis workflow, providing the aggregate metrics that inform column-wise and correlation analyses.
Theoretical Basis
For each statistic column s in the dataset:
Quantiles are computed at configurable percentile points (default: [0.25, 0.5, 0.75]).