Workflow:Datajuicer Data juicer Dataset Quality Analysis

Knowledge Sources	Data-Juicer Data-Juicer Docs
Domains	Data_Engineering, Data_Analysis, Data_Quality
Last Updated	2026-02-14 16:00 GMT

Overview

End-to-end process for profiling and analyzing dataset quality by computing filter statistics, generating distribution visualizations, and producing correlation reports without modifying the original data.

Description

This workflow uses Data-Juicer's Analyzer to evaluate dataset quality before or after processing. Rather than filtering or transforming data, the Analyzer computes statistics for each filter and tagging operator in the configuration, then runs three layers of analysis: OverallAnalysis (aggregate descriptive statistics like mean, median, percentiles), ColumnWiseAnalysis (per-column distribution histograms and box plots), and CorrelationAnalysis (pairwise correlation between computed metrics). The results are exported as statistical tables and visualization figures, helping users understand data characteristics, identify quality issues, and tune operator thresholds.

Usage

Execute this workflow when you need to understand the quality profile of a dataset before applying processing, or to validate the results of a processing pipeline. Typical inputs are JSONL or Parquet datasets. The output includes statistical summaries, distribution plots, and correlation matrices saved to an analysis directory.

Execution Steps

Step 1: Define Analysis Configuration

Create a YAML configuration file specifying the dataset path, export path, and a list of filter or tagging operators whose statistics you want to compute. The operator list uses the same syntax as the processing pipeline, but only the compute_stats phase of each filter will be executed (no samples are removed).

Key considerations:

Use the same operator names and parameters as in processing configs
Only Filter operators and Tagging operators contribute statistics
The auto flag can limit analysis to a subset for faster profiling
The percentiles configuration controls which percentiles to compute

Step 2: Load Dataset

The DatasetBuilder loads the dataset using the same infrastructure as the processing pipeline. If auto mode is enabled, only a configurable subset of samples (default defined by auto_num) is loaded to reduce computation time for large datasets.

Key considerations:

Supports all formats: JSONL, CSV, Parquet, TSV, plain text
Auto mode samples a representative subset for faster analysis
The full dataset is loaded by default if auto mode is disabled

Step 3: Compute Statistics

For each filter operator in the process list, the Analyzer calls only the compute_stats phase, which annotates each sample with computed metrics (e.g., text length, perplexity score, language ID score) stored in the __dj__stats__ field. The actual filtering decision is skipped. For tagging operators, the full process runs to generate tags and metadata.

Key considerations:

Statistics are computed per-sample and stored as additional columns
Operator fusion can be applied to speed up multi-filter computation
Non-stats filters (those without a compute_stats phase) are skipped
If no stats are collected, a warning is raised

Step 4: Export Statistics

The computed statistics are exported to disk. Unlike the processing pipeline, the Analyzer exports only the stats columns by default (not the full dataset text), though the original dataset can optionally be included via the export_original_dataset flag.

Key considerations:

Stats export uses the same Exporter infrastructure as processing
Cache compression can be applied to reduce storage footprint
The exported stats serve as input for the analysis phase

Step 5: Run Overall Analysis

OverallAnalysis computes aggregate descriptive statistics across the entire dataset for each metric column: count, mean, standard deviation, min, max, and configurable percentiles. Results are saved as a summary table.

Key considerations:

Percentiles are configurable (default includes 25th, 50th, 75th)
Results are logged and saved to the analysis output directory
This step provides the statistical foundation for column-wise analysis

Step 6: Run Column Wise Analysis

ColumnWiseAnalysis generates per-column distribution visualizations. For numeric fields, it produces histograms and box plots. For string fields, it produces frequency histograms. Results can be saved as individual plot files or combined into a single file.

Key considerations:

Visualization output format depends on the data type of each column
The save_stats_in_one_file flag controls single vs. multi-file output
Overall analysis results are used to annotate plots with statistical markers

Step 7: Run Correlation Analysis

CorrelationAnalysis computes pairwise Pearson, Spearman, or Kendall correlation coefficients between all numeric metric columns. Results are saved as a correlation matrix and heatmap visualization.

Key considerations:

Helps identify redundant filters (highly correlated metrics)
Useful for understanding relationships between quality dimensions
Output includes both numeric matrix and visual heatmap

Execution Diagram

GitHub URL

Workflow Repository