Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Gretelai Gretel synthetics Conditional Data Sampling

From Leeroopedia
Knowledge Sources
Domains Synthetic_Data, GAN, Tabular_Data
Last Updated 2026-02-14 19:00 GMT

Overview

Conditional data sampling is the process of generating synthetic tabular data from a trained GAN by feeding random noise and optional conditioning signals through the generator network, then reversing the data transformations to recover the original data representation.

Description

After a GAN has been trained on tabular data, the generator network can produce synthetic samples by accepting two inputs: a random noise vector (sampled from a standard normal distribution) and an optional conditional vector that steers the generated output toward specified column values. The generation pipeline consists of several stages:

  1. Conditional vector preparation: Depending on the conditional vector type and whether user-specified conditions are provided, a conditional vector is constructed. For SINGLE_DISCRETE mode, one discrete column's distribution is sampled to guide generation. For ANYWAY mode, either the user's conditions are encoded into a full-width conditional vector, or a zero vector is used for unconditional generation.
  2. Batch-wise generation: The required number of rows n is generated in batches of batch_size. For each batch, random noise is sampled, the generator produces encoded output, and activation functions (tanh, sigmoid, gumbel_softmax) are applied column-by-column.
  3. Inverse transformation: The encoded generator output is converted back to the original data representation through the DataTransformer.inverse_transform() method, which reverses the Bayesian GMM normalization for continuous columns and the one-hot/binary encoding for discrete columns.
  4. Force conditioning (optional): If force_conditioning=True, the conditioned column values are directly overwritten in the output DataFrame, bypassing the stochastic nature of GAN generation for those columns.

A key design choice is that the generator is switched to eval mode during sampling (disabling dropout and using running statistics for batch normalization) and switched back to train mode afterwards.

Usage

Use this principle when generating synthetic data after training an ACTGAN model. The sampling API supports:

  • Unconditional generation: model.sample(n) generates n rows without any conditions
  • Conditional generation: model.sample(n, conditions={"column_name": value}) generates rows where the specified columns are biased toward the given values
  • Force conditioning: When force_conditioning=True was set during model initialization, conditioned columns are directly set to the requested values in the output

Theoretical Basis

The sampling process follows the reverse path of the training data flow:

Sampling Pipeline:
    z ~ N(0, I)                     # Random noise of shape [batch_size, embedding_dim]
    cond_vec = encode(conditions)   # Conditional vector or zero vector
    input = concat(z, cond_vec)     # Generator input
    raw_output = G(input)           # Raw generator output
    activated = apply_activations(raw_output)  # Per-column activation functions
    original_data = inverse_transform(activated)  # Reverse data transformation

Conditional Vector Construction

For SINGLE_DISCRETE mode (no user conditions):

Sample from the original conditional vector distribution learned during training.
This selects one discrete column and one category within it, weighted by log-frequency.

For ANYWAY mode with user conditions:

For each column in the data:
    If column is in conditions:
        Encode the condition value using DataTransformer
    Else:
        Fill with zeros (unconditioned)
Repeat the conditional vector for every row in the batch.

For ANYWAY mode without conditions:

Use a zero vector of shape [batch_size, cond_vec_dim].

Inverse Transformation

The inverse transformation reverses the DataTransformer encoding:

For each column in column_transform_info_list:
    Extract the column's slice from the encoded data
    If continuous:
        Take the normalized value and component one-hot vector
        Select component via argmax
        Apply reverse BayesianGMM normalization
    If discrete:
        Apply reverse OneHotEncoder or BinaryEncoder transform
Reconstruct DataFrame with original column names and dtypes.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment