Implementation:Datajuicer Data juicer CorrelationAnalysis Analyze
| Knowledge Sources | |
|---|---|
| Domains | Statistics, Data_Quality, Visualization |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for computing pairwise correlation matrices and heatmaps for dataset quality metrics provided by the Data-Juicer framework.
Description
The CorrelationAnalysis class computes a pairwise correlation matrix between all numeric __dj__stats__ columns using pandas DataFrame.corr(). It supports Pearson, Kendall, and Spearman methods. The result is visualized as a heatmap figure saved as stats-corr-{method}.png.
Usage
Use after computing per-sample statistics. Pass the dataset and output path, then call analyze() with the desired correlation method.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/analysis/correlation_analysis.py
- Lines: L146-194
Signature
class CorrelationAnalysis:
def __init__(self, dataset, output_path):
"""
Args:
dataset: Dataset with numeric __dj__stats__ columns.
output_path: Directory for figure output.
"""
def analyze(
self,
method='pearson',
show=False,
skip_export=False
) -> pd.DataFrame:
"""
Compute correlation matrix and generate heatmap.
Args:
method: Correlation method ('pearson', 'kendall', 'spearman').
show: Display plot interactively.
skip_export: Skip saving figure file.
Returns:
pd.DataFrame correlation matrix.
"""
Import
from data_juicer.analysis.correlation_analysis import CorrelationAnalysis
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | Dataset | Yes | Dataset with numeric stats columns |
| output_path | str | Yes (init) | Directory for figure output |
| method | str | No | Correlation method (default: 'pearson') |
Outputs
| Name | Type | Description |
|---|---|---|
| corr_matrix | pd.DataFrame | Pairwise correlation matrix |
| heatmap | PNG file | stats-corr-{method}.png heatmap figure |
Usage Examples
Correlation Heatmap
from data_juicer.analysis.correlation_analysis import CorrelationAnalysis
corr = CorrelationAnalysis(dataset_with_stats, './analysis/')
matrix = corr.analyze(method='spearman')
print(matrix)
# text_len lang_score perplexity
# text_len 1.000 0.123 -0.456
# lang_score 0.123 1.000 -0.789
# perplexity -0.456 -0.789 1.000