Implementation:Datajuicer Data juicer ColumnWiseAnalysis Analyze
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Statistics, Visualization |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for generating per-column distribution plots for dataset quality metrics provided by the Data-Juicer framework.
Description
The ColumnWiseAnalysis class generates histogram + box plot figures for each numeric statistics column, and word clouds for text/categorical columns. It supports combining all plots into a single image, overlaying percentile lines, and using precomputed overall analysis results to annotate distributions.
Usage
Use after OverallAnalysis.analyze() in the analysis pipeline. Pass the dataset, output path, and optionally the overall result DataFrame.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/analysis/column_wise_analysis.py
- Lines: L66-333
Signature
class ColumnWiseAnalysis:
def __init__(
self,
dataset,
output_path,
overall_result=None,
save_stats_in_one_file=True
):
"""
Args:
dataset: Dataset with __dj__stats__ columns.
output_path: Directory for figure output.
overall_result: Precomputed overall stats DataFrame.
save_stats_in_one_file: Combine all plots into one image.
"""
def analyze(
self,
show_percentiles=False,
show=False,
skip_export=False
) -> None:
"""
Generate per-column distribution visualizations.
Args:
show_percentiles: Draw quantile lines on histograms.
show: Display plots interactively.
skip_export: Skip saving figure files.
"""
Import
from data_juicer.analysis.column_wise_analysis import ColumnWiseAnalysis
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | Dataset | Yes | Dataset with __dj__stats__ columns |
| output_path | str | Yes (init) | Directory for figure files |
| overall_result | pd.DataFrame | No | Precomputed descriptive statistics |
| show_percentiles | bool | No | Overlay quantile lines (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| figures | PNG files | Per-column histogram + box plot images in analysis/ directory |
Usage Examples
Column-Wise Visualization
from data_juicer.analysis.column_wise_analysis import ColumnWiseAnalysis
from data_juicer.analysis.overall_analysis import OverallAnalysis
# First compute overall stats
overall = OverallAnalysis(dataset, './analysis/')
overall_result = overall.analyze()
# Then generate per-column plots
column_analysis = ColumnWiseAnalysis(
dataset,
'./analysis/',
overall_result=overall_result,
save_stats_in_one_file=True
)
column_analysis.analyze(show_percentiles=True)
# Generates PNG figures in ./analysis/ directory