Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer ColumnWiseAnalysis Analyze

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Statistics, Visualization
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for generating per-column distribution plots for dataset quality metrics provided by the Data-Juicer framework.

Description

The ColumnWiseAnalysis class generates histogram + box plot figures for each numeric statistics column, and word clouds for text/categorical columns. It supports combining all plots into a single image, overlaying percentile lines, and using precomputed overall analysis results to annotate distributions.

Usage

Use after OverallAnalysis.analyze() in the analysis pipeline. Pass the dataset, output path, and optionally the overall result DataFrame.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/analysis/column_wise_analysis.py
  • Lines: L66-333

Signature

class ColumnWiseAnalysis:
    def __init__(
        self,
        dataset,
        output_path,
        overall_result=None,
        save_stats_in_one_file=True
    ):
        """
        Args:
            dataset: Dataset with __dj__stats__ columns.
            output_path: Directory for figure output.
            overall_result: Precomputed overall stats DataFrame.
            save_stats_in_one_file: Combine all plots into one image.
        """

    def analyze(
        self,
        show_percentiles=False,
        show=False,
        skip_export=False
    ) -> None:
        """
        Generate per-column distribution visualizations.

        Args:
            show_percentiles: Draw quantile lines on histograms.
            show: Display plots interactively.
            skip_export: Skip saving figure files.
        """

Import

from data_juicer.analysis.column_wise_analysis import ColumnWiseAnalysis

I/O Contract

Inputs

Name Type Required Description
dataset Dataset Yes Dataset with __dj__stats__ columns
output_path str Yes (init) Directory for figure files
overall_result pd.DataFrame No Precomputed descriptive statistics
show_percentiles bool No Overlay quantile lines (default: False)

Outputs

Name Type Description
figures PNG files Per-column histogram + box plot images in analysis/ directory

Usage Examples

Column-Wise Visualization

from data_juicer.analysis.column_wise_analysis import ColumnWiseAnalysis
from data_juicer.analysis.overall_analysis import OverallAnalysis

# First compute overall stats
overall = OverallAnalysis(dataset, './analysis/')
overall_result = overall.analyze()

# Then generate per-column plots
column_analysis = ColumnWiseAnalysis(
    dataset,
    './analysis/',
    overall_result=overall_result,
    save_stats_in_one_file=True
)
column_analysis.analyze(show_percentiles=True)
# Generates PNG figures in ./analysis/ directory

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment