Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Gretelai Gretel synthetics Statistical Quality Evaluation

From Leeroopedia
Knowledge Sources
Domains Statistics, Synthetic_Data, Data_Quality
Last Updated 2026-02-14 20:00 GMT

Overview

Concrete tool for evaluating the statistical fidelity of synthetic datasets by comparing distributions, correlations, memorization, and principal components against original training data.

Description

The stats module provides a comprehensive statistical evaluation toolkit for comparing training and synthetic DataFrames. It implements five key capabilities:

  • Memorization detection via count_memorized_lines, which identifies duplicate rows between training and synthetic data using inner joins after careful type alignment.
  • Distribution analysis via get_categorical_field_distribution, get_numeric_distribution_bins, get_numeric_field_distribution, and compute_distribution_distance, which compute per-column distribution distances using Jensen-Shannon divergence.
  • Correlation matrix computation via calculate_correlation, which builds a full mixed-type correlation matrix using Pearson's r (numeric-numeric), correlation ratio (categorical-numeric), and Theil's U (categorical-categorical), with parallelized computation via joblib.
  • PCA analysis via normalize_dataset and compute_pca, which prepare data and compute principal components for visual comparison of training vs. synthetic distributions.

The correlation matrix output feeds directly into the header clustering algorithm used by DataFrameBatch.

Usage

Import this module when evaluating the quality of generated synthetic data. Use distribution distance functions to compare per-column fidelity, the correlation matrix to compare structural relationships between columns, memorization counting to detect privacy leaks, and PCA to visually compare dataset distributions.

Code Reference

Source Location

Signature

def count_memorized_lines(df1: pd.DataFrame, df2: pd.DataFrame) -> int:
    """Checks for overlap between training and synthesized data."""

def get_categorical_field_distribution(field: pd.Series) -> dict:
    """Calculates the normalized distribution of a categorical field."""

def get_numeric_distribution_bins(training: pd.Series, synthetic: pd.Series):
    """Computes shared bin edges for comparing two numeric series."""

def get_numeric_field_distribution(field: pd.Series, bins) -> dict:
    """Calculates the normalized distribution of a numeric field cut into bins."""

def compute_distribution_distance(d1: dict, d2: dict) -> float:
    """Calculates the Jensen-Shannon distance between two distributions."""

def calculate_pearsons_r(x, y, opt) -> Tuple[float, float]:
    """Calculate the Pearson correlation coefficient for a pair of numeric arrays."""

def calculate_correlation_ratio(x, y, opt):
    """Calculates the Correlation Ratio for categorical-continuous association."""

def calculate_theils_u(x, y):
    """Calculates Theil's U statistic for categorical-categorical association."""

def calculate_correlation(
    df: pd.DataFrame,
    nominal_columns: List[str] = None,
    job_count: int = 4,
    opt: bool = False,
) -> pd.DataFrame:
    """Builds a full mixed-type correlation matrix."""

def normalize_dataset(df: pd.DataFrame) -> pd.DataFrame:
    """Prepares a DataFrame for PCA via encoding and standardization."""

def compute_pca(df: pd.DataFrame, n_components: int = 2) -> pd.DataFrame:
    """Performs PCA dimensionality reduction on a DataFrame."""

Import

from gretel_synthetics.utils.stats import (
    count_memorized_lines,
    compute_distribution_distance,
    get_categorical_field_distribution,
    get_numeric_distribution_bins,
    get_numeric_field_distribution,
    calculate_correlation,
    normalize_dataset,
    compute_pca,
)

I/O Contract

Inputs

Name Type Required Description
df1 / training pd.DataFrame Yes Training (original) dataset
df2 / synthetic pd.DataFrame Yes Synthetic (generated) dataset
field pd.Series Yes Single column extracted from a DataFrame (for distribution functions)
bins np.ndarray Yes (for numeric distribution) Bin edges from get_numeric_distribution_bins
d1, d2 dict Yes (for distance) Distribution dicts mapping values to probabilities
nominal_columns List[str] No Columns to treat as categorical in correlation matrix
job_count int No Number of parallel jobs for correlation (default: 4)
opt bool No If True, globally replace NaN with 0.0 for faster but less precise computation
n_components int No Number of PCA components to keep (default: 2)

Outputs

Function Return Type Description
count_memorized_lines int Number of overlapping rows between training and synthetic data
get_categorical_field_distribution dict Keys are unique values, values are percentages in [0, 100]
get_numeric_distribution_bins np.ndarray Array of bin edges for histogram binning
get_numeric_field_distribution dict Keys are bin intervals, values are proportions in [0, 1]
compute_distribution_distance float Jensen-Shannon distance in [0, 1]
calculate_pearsons_r Tuple[float, float] (Pearson coefficient, two-tailed p-value)
calculate_correlation_ratio float Correlation ratio in [0, 1]
calculate_theils_u float Theil's U in [0, 1]
calculate_correlation pd.DataFrame Square correlation matrix indexed by column names
normalize_dataset pd.DataFrame Standardized DataFrame ready for PCA
compute_pca pd.DataFrame DataFrame with columns pc1, pc2, ... pcN

Usage Examples

Memorization Check

import pandas as pd
from gretel_synthetics.utils.stats import count_memorized_lines

# Load training and synthetic data
training_df = pd.read_csv("training_data.csv")
synthetic_df = pd.read_csv("synthetic_data.csv")

# Check for duplicated rows
overlap = count_memorized_lines(training_df, synthetic_df)
print(f"Memorized lines: {overlap}")

Per-Column Distribution Distance

from gretel_synthetics.utils.stats import (
    get_categorical_field_distribution,
    get_numeric_distribution_bins,
    get_numeric_field_distribution,
    compute_distribution_distance,
)

# For categorical columns
train_dist = get_categorical_field_distribution(training_df["category_col"])
synth_dist = get_categorical_field_distribution(synthetic_df["category_col"])

# Normalize to probability vectors (values sum to 100, so divide)
train_norm = {k: v / 100 for k, v in train_dist.items()}
synth_norm = {k: v / 100 for k, v in synth_dist.items()}
cat_distance = compute_distribution_distance(train_norm, synth_norm)

# For numeric columns
bins = get_numeric_distribution_bins(training_df["numeric_col"], synthetic_df["numeric_col"])
train_num_dist = get_numeric_field_distribution(training_df["numeric_col"], bins)
synth_num_dist = get_numeric_field_distribution(synthetic_df["numeric_col"], bins)
num_distance = compute_distribution_distance(train_num_dist, synth_num_dist)
print(f"Categorical JS distance: {cat_distance:.4f}")
print(f"Numeric JS distance: {num_distance:.4f}")

Correlation Matrix

from gretel_synthetics.utils.stats import calculate_correlation

# Build correlation matrix for a mixed-type DataFrame
nominal_cols = ["gender", "city", "product_type"]
corr_matrix = calculate_correlation(
    training_df,
    nominal_columns=nominal_cols,
    job_count=4,
    opt=False,
)
print(corr_matrix)

PCA Comparison

from gretel_synthetics.utils.stats import compute_pca

# Compute 2-component PCA for visual comparison
train_pca = compute_pca(training_df, n_components=2)
synth_pca = compute_pca(synthetic_df, n_components=2)

# Plot the two PCA projections to visually compare distributions
import matplotlib.pyplot as plt
plt.scatter(train_pca["pc1"], train_pca["pc2"], alpha=0.3, label="Training")
plt.scatter(synth_pca["pc1"], synth_pca["pc2"], alpha=0.3, label="Synthetic")
plt.legend()
plt.show()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment