Implementation:Gretelai Gretel synthetics Statistical Quality Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Statistics, Synthetic_Data, Data_Quality |
| Last Updated | 2026-02-14 20:00 GMT |
Overview
Concrete tool for evaluating the statistical fidelity of synthetic datasets by comparing distributions, correlations, memorization, and principal components against original training data.
Description
The stats module provides a comprehensive statistical evaluation toolkit for comparing training and synthetic DataFrames. It implements five key capabilities:
- Memorization detection via
count_memorized_lines, which identifies duplicate rows between training and synthetic data using inner joins after careful type alignment. - Distribution analysis via
get_categorical_field_distribution,get_numeric_distribution_bins,get_numeric_field_distribution, andcompute_distribution_distance, which compute per-column distribution distances using Jensen-Shannon divergence. - Correlation matrix computation via
calculate_correlation, which builds a full mixed-type correlation matrix using Pearson's r (numeric-numeric), correlation ratio (categorical-numeric), and Theil's U (categorical-categorical), with parallelized computation via joblib. - PCA analysis via
normalize_datasetandcompute_pca, which prepare data and compute principal components for visual comparison of training vs. synthetic distributions.
The correlation matrix output feeds directly into the header clustering algorithm used by DataFrameBatch.
Usage
Import this module when evaluating the quality of generated synthetic data. Use distribution distance functions to compare per-column fidelity, the correlation matrix to compare structural relationships between columns, memorization counting to detect privacy leaks, and PCA to visually compare dataset distributions.
Code Reference
Source Location
- Repository: Gretelai_Gretel_synthetics
- File: src/gretel_synthetics/utils/stats.py
- Lines: 1-539
Signature
def count_memorized_lines(df1: pd.DataFrame, df2: pd.DataFrame) -> int:
"""Checks for overlap between training and synthesized data."""
def get_categorical_field_distribution(field: pd.Series) -> dict:
"""Calculates the normalized distribution of a categorical field."""
def get_numeric_distribution_bins(training: pd.Series, synthetic: pd.Series):
"""Computes shared bin edges for comparing two numeric series."""
def get_numeric_field_distribution(field: pd.Series, bins) -> dict:
"""Calculates the normalized distribution of a numeric field cut into bins."""
def compute_distribution_distance(d1: dict, d2: dict) -> float:
"""Calculates the Jensen-Shannon distance between two distributions."""
def calculate_pearsons_r(x, y, opt) -> Tuple[float, float]:
"""Calculate the Pearson correlation coefficient for a pair of numeric arrays."""
def calculate_correlation_ratio(x, y, opt):
"""Calculates the Correlation Ratio for categorical-continuous association."""
def calculate_theils_u(x, y):
"""Calculates Theil's U statistic for categorical-categorical association."""
def calculate_correlation(
df: pd.DataFrame,
nominal_columns: List[str] = None,
job_count: int = 4,
opt: bool = False,
) -> pd.DataFrame:
"""Builds a full mixed-type correlation matrix."""
def normalize_dataset(df: pd.DataFrame) -> pd.DataFrame:
"""Prepares a DataFrame for PCA via encoding and standardization."""
def compute_pca(df: pd.DataFrame, n_components: int = 2) -> pd.DataFrame:
"""Performs PCA dimensionality reduction on a DataFrame."""
Import
from gretel_synthetics.utils.stats import (
count_memorized_lines,
compute_distribution_distance,
get_categorical_field_distribution,
get_numeric_distribution_bins,
get_numeric_field_distribution,
calculate_correlation,
normalize_dataset,
compute_pca,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| df1 / training | pd.DataFrame | Yes | Training (original) dataset |
| df2 / synthetic | pd.DataFrame | Yes | Synthetic (generated) dataset |
| field | pd.Series | Yes | Single column extracted from a DataFrame (for distribution functions) |
| bins | np.ndarray | Yes (for numeric distribution) | Bin edges from get_numeric_distribution_bins
|
| d1, d2 | dict | Yes (for distance) | Distribution dicts mapping values to probabilities |
| nominal_columns | List[str] | No | Columns to treat as categorical in correlation matrix |
| job_count | int | No | Number of parallel jobs for correlation (default: 4) |
| opt | bool | No | If True, globally replace NaN with 0.0 for faster but less precise computation |
| n_components | int | No | Number of PCA components to keep (default: 2) |
Outputs
| Function | Return Type | Description |
|---|---|---|
| count_memorized_lines | int | Number of overlapping rows between training and synthetic data |
| get_categorical_field_distribution | dict | Keys are unique values, values are percentages in [0, 100] |
| get_numeric_distribution_bins | np.ndarray | Array of bin edges for histogram binning |
| get_numeric_field_distribution | dict | Keys are bin intervals, values are proportions in [0, 1] |
| compute_distribution_distance | float | Jensen-Shannon distance in [0, 1] |
| calculate_pearsons_r | Tuple[float, float] | (Pearson coefficient, two-tailed p-value) |
| calculate_correlation_ratio | float | Correlation ratio in [0, 1] |
| calculate_theils_u | float | Theil's U in [0, 1] |
| calculate_correlation | pd.DataFrame | Square correlation matrix indexed by column names |
| normalize_dataset | pd.DataFrame | Standardized DataFrame ready for PCA |
| compute_pca | pd.DataFrame | DataFrame with columns pc1, pc2, ... pcN |
Usage Examples
Memorization Check
import pandas as pd
from gretel_synthetics.utils.stats import count_memorized_lines
# Load training and synthetic data
training_df = pd.read_csv("training_data.csv")
synthetic_df = pd.read_csv("synthetic_data.csv")
# Check for duplicated rows
overlap = count_memorized_lines(training_df, synthetic_df)
print(f"Memorized lines: {overlap}")
Per-Column Distribution Distance
from gretel_synthetics.utils.stats import (
get_categorical_field_distribution,
get_numeric_distribution_bins,
get_numeric_field_distribution,
compute_distribution_distance,
)
# For categorical columns
train_dist = get_categorical_field_distribution(training_df["category_col"])
synth_dist = get_categorical_field_distribution(synthetic_df["category_col"])
# Normalize to probability vectors (values sum to 100, so divide)
train_norm = {k: v / 100 for k, v in train_dist.items()}
synth_norm = {k: v / 100 for k, v in synth_dist.items()}
cat_distance = compute_distribution_distance(train_norm, synth_norm)
# For numeric columns
bins = get_numeric_distribution_bins(training_df["numeric_col"], synthetic_df["numeric_col"])
train_num_dist = get_numeric_field_distribution(training_df["numeric_col"], bins)
synth_num_dist = get_numeric_field_distribution(synthetic_df["numeric_col"], bins)
num_distance = compute_distribution_distance(train_num_dist, synth_num_dist)
print(f"Categorical JS distance: {cat_distance:.4f}")
print(f"Numeric JS distance: {num_distance:.4f}")
Correlation Matrix
from gretel_synthetics.utils.stats import calculate_correlation
# Build correlation matrix for a mixed-type DataFrame
nominal_cols = ["gender", "city", "product_type"]
corr_matrix = calculate_correlation(
training_df,
nominal_columns=nominal_cols,
job_count=4,
opt=False,
)
print(corr_matrix)
PCA Comparison
from gretel_synthetics.utils.stats import compute_pca
# Compute 2-component PCA for visual comparison
train_pca = compute_pca(training_df, n_components=2)
synth_pca = compute_pca(synthetic_df, n_components=2)
# Plot the two PCA projections to visually compare distributions
import matplotlib.pyplot as plt
plt.scatter(train_pca["pc1"], train_pca["pc2"], alpha=0.3, label="Training")
plt.scatter(synth_pca["pc1"], synth_pca["pc2"], alpha=0.3, label="Synthetic")
plt.legend()
plt.show()