Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer CorrelationAnalysis Analyze

From Leeroopedia
Knowledge Sources
Domains Statistics, Data_Quality, Visualization
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for computing pairwise correlation matrices and heatmaps for dataset quality metrics provided by the Data-Juicer framework.

Description

The CorrelationAnalysis class computes a pairwise correlation matrix between all numeric __dj__stats__ columns using pandas DataFrame.corr(). It supports Pearson, Kendall, and Spearman methods. The result is visualized as a heatmap figure saved as stats-corr-{method}.png.

Usage

Use after computing per-sample statistics. Pass the dataset and output path, then call analyze() with the desired correlation method.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/analysis/correlation_analysis.py
  • Lines: L146-194

Signature

class CorrelationAnalysis:
    def __init__(self, dataset, output_path):
        """
        Args:
            dataset: Dataset with numeric __dj__stats__ columns.
            output_path: Directory for figure output.
        """

    def analyze(
        self,
        method='pearson',
        show=False,
        skip_export=False
    ) -> pd.DataFrame:
        """
        Compute correlation matrix and generate heatmap.

        Args:
            method: Correlation method ('pearson', 'kendall', 'spearman').
            show: Display plot interactively.
            skip_export: Skip saving figure file.

        Returns:
            pd.DataFrame correlation matrix.
        """

Import

from data_juicer.analysis.correlation_analysis import CorrelationAnalysis

I/O Contract

Inputs

Name Type Required Description
dataset Dataset Yes Dataset with numeric stats columns
output_path str Yes (init) Directory for figure output
method str No Correlation method (default: 'pearson')

Outputs

Name Type Description
corr_matrix pd.DataFrame Pairwise correlation matrix
heatmap PNG file stats-corr-{method}.png heatmap figure

Usage Examples

Correlation Heatmap

from data_juicer.analysis.correlation_analysis import CorrelationAnalysis

corr = CorrelationAnalysis(dataset_with_stats, './analysis/')
matrix = corr.analyze(method='spearman')
print(matrix)
#                text_len  lang_score  perplexity
# text_len        1.000      0.123      -0.456
# lang_score      0.123      1.000      -0.789
# perplexity     -0.456     -0.789       1.000

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment