Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Gretelai Gretel synthetics Synthetic Data Quality Evaluation

From Leeroopedia
Knowledge Sources
Domains Statistics, Synthetic_Data, Data_Quality
Last Updated 2026-02-14 20:00 GMT

Overview

Theoretical framework for measuring the statistical fidelity, privacy, and structural similarity of synthetic datasets compared to their original training data.

Description

Synthetic Data Quality Evaluation encompasses a set of statistical methods used to answer the question: "How faithfully does a synthetic dataset reproduce the properties of the training data, while avoiding memorization?" This involves four complementary dimensions:

1. Distribution Fidelity: Comparing per-column value distributions between training and synthetic data. For categorical columns, this means comparing frequency distributions. For numeric columns, data is first binned into shared histogram bins, then distributions are compared using Jensen-Shannon divergence, which provides a symmetric, bounded measure of similarity between two probability distributions.

2. Correlation Structure: Comparing the inter-column relationships. A mixed-type correlation matrix is built using the appropriate statistical measure for each column-pair type: Pearson's r for numeric-numeric, Correlation Ratio (eta) for categorical-numeric, and Theil's U (Uncertainty Coefficient) for categorical-categorical. Comparing these matrices between training and synthetic data reveals whether the synthetic data preserves multivariate dependencies.

3. Memorization Detection: Checking whether the synthetic data simply copied rows from the training data. This is a privacy concern; a good synthetic generator should produce novel rows that share statistical properties with, but are not identical to, the original data.

4. Dimensionality Reduction: Using Principal Component Analysis (PCA) to project both datasets into a low-dimensional space for visual inspection. If the synthetic data captures the training data's structure, the PCA projections should overlap.

Usage

Apply this principle after generating synthetic data from any model (ACTGAN, DGAN, LSTM) to evaluate output quality. Use distribution distance for per-column fidelity, correlation comparison for structural accuracy, memorization counting for privacy validation, and PCA for holistic visual assessment.

Theoretical Basis

The core statistical measures used are:

Jensen-Shannon Divergence (JSD): JSD(PQ)=12DKL(PM)+12DKL(QM) where M=12(P+Q) and DKL is the Kullback-Leibler divergence. JSD is symmetric and bounded in [0, 1] when using base-2 logarithms.

Pearson's r: Measures linear correlation between two numeric variables: r=(xix¯)(yiy¯)(xix¯)2(yiy¯)2

Correlation Ratio (eta): Measures the association between a categorical variable and a numeric variable by comparing the variance of the numeric variable within each category to its overall variance.

Theil's U (Uncertainty Coefficient): An asymmetric measure of association between two categorical variables, based on conditional entropy: U(X|Y)=H(X)H(X|Y)H(X) where H is entropy.

Pseudo-code for mixed correlation matrix:

# Abstract algorithm (NOT real implementation)
for each column pair (A, B):
    if both numeric:
        corr[A][B] = pearson_r(A, B)
    elif one categorical, one numeric:
        corr[A][B] = correlation_ratio(categorical, numeric)
    elif both categorical:
        corr[A][B] = theils_u(A, B)  # asymmetric

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment