Workflow:Datajuicer Data juicer Dataset Quality Analysis
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Analysis, Data_Quality |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
End-to-end process for profiling and analyzing dataset quality by computing filter statistics, generating distribution visualizations, and producing correlation reports without modifying the original data.
Description
This workflow uses Data-Juicer's Analyzer to evaluate dataset quality before or after processing. Rather than filtering or transforming data, the Analyzer computes statistics for each filter and tagging operator in the configuration, then runs three layers of analysis: OverallAnalysis (aggregate descriptive statistics like mean, median, percentiles), ColumnWiseAnalysis (per-column distribution histograms and box plots), and CorrelationAnalysis (pairwise correlation between computed metrics). The results are exported as statistical tables and visualization figures, helping users understand data characteristics, identify quality issues, and tune operator thresholds.
Usage
Execute this workflow when you need to understand the quality profile of a dataset before applying processing, or to validate the results of a processing pipeline. Typical inputs are JSONL or Parquet datasets. The output includes statistical summaries, distribution plots, and correlation matrices saved to an analysis directory.
Execution Steps
Step 1: Define Analysis Configuration
Create a YAML configuration file specifying the dataset path, export path, and a list of filter or tagging operators whose statistics you want to compute. The operator list uses the same syntax as the processing pipeline, but only the compute_stats phase of each filter will be executed (no samples are removed).
Key considerations:
- Use the same operator names and parameters as in processing configs
- Only Filter operators and Tagging operators contribute statistics
- The auto flag can limit analysis to a subset for faster profiling
- The percentiles configuration controls which percentiles to compute
Step 2: Load Dataset
The DatasetBuilder loads the dataset using the same infrastructure as the processing pipeline. If auto mode is enabled, only a configurable subset of samples (default defined by auto_num) is loaded to reduce computation time for large datasets.
Key considerations:
- Supports all formats: JSONL, CSV, Parquet, TSV, plain text
- Auto mode samples a representative subset for faster analysis
- The full dataset is loaded by default if auto mode is disabled
Step 3: Compute Statistics
For each filter operator in the process list, the Analyzer calls only the compute_stats phase, which annotates each sample with computed metrics (e.g., text length, perplexity score, language ID score) stored in the __dj__stats__ field. The actual filtering decision is skipped. For tagging operators, the full process runs to generate tags and metadata.
Key considerations:
- Statistics are computed per-sample and stored as additional columns
- Operator fusion can be applied to speed up multi-filter computation
- Non-stats filters (those without a compute_stats phase) are skipped
- If no stats are collected, a warning is raised
Step 4: Export Statistics
The computed statistics are exported to disk. Unlike the processing pipeline, the Analyzer exports only the stats columns by default (not the full dataset text), though the original dataset can optionally be included via the export_original_dataset flag.
Key considerations:
- Stats export uses the same Exporter infrastructure as processing
- Cache compression can be applied to reduce storage footprint
- The exported stats serve as input for the analysis phase
Step 5: Run Overall Analysis
OverallAnalysis computes aggregate descriptive statistics across the entire dataset for each metric column: count, mean, standard deviation, min, max, and configurable percentiles. Results are saved as a summary table.
Key considerations:
- Percentiles are configurable (default includes 25th, 50th, 75th)
- Results are logged and saved to the analysis output directory
- This step provides the statistical foundation for column-wise analysis
Step 6: Run Column Wise Analysis
ColumnWiseAnalysis generates per-column distribution visualizations. For numeric fields, it produces histograms and box plots. For string fields, it produces frequency histograms. Results can be saved as individual plot files or combined into a single file.
Key considerations:
- Visualization output format depends on the data type of each column
- The save_stats_in_one_file flag controls single vs. multi-file output
- Overall analysis results are used to annotate plots with statistical markers
Step 7: Run Correlation Analysis
CorrelationAnalysis computes pairwise Pearson, Spearman, or Kendall correlation coefficients between all numeric metric columns. Results are saved as a correlation matrix and heatmap visualization.
Key considerations:
- Helps identify redundant filters (highly correlated metrics)
- Useful for understanding relationships between quality dimensions
- Output includes both numeric matrix and visual heatmap