Principle:Gretelai Gretel synthetics Conditional Data Sampling
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, GAN, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Conditional data sampling is the process of generating synthetic tabular data from a trained GAN by feeding random noise and optional conditioning signals through the generator network, then reversing the data transformations to recover the original data representation.
Description
After a GAN has been trained on tabular data, the generator network can produce synthetic samples by accepting two inputs: a random noise vector (sampled from a standard normal distribution) and an optional conditional vector that steers the generated output toward specified column values. The generation pipeline consists of several stages:
- Conditional vector preparation: Depending on the conditional vector type and whether user-specified conditions are provided, a conditional vector is constructed. For SINGLE_DISCRETE mode, one discrete column's distribution is sampled to guide generation. For ANYWAY mode, either the user's conditions are encoded into a full-width conditional vector, or a zero vector is used for unconditional generation.
- Batch-wise generation: The required number of rows
nis generated in batches ofbatch_size. For each batch, random noise is sampled, the generator produces encoded output, and activation functions (tanh, sigmoid, gumbel_softmax) are applied column-by-column. - Inverse transformation: The encoded generator output is converted back to the original data representation through the
DataTransformer.inverse_transform()method, which reverses the Bayesian GMM normalization for continuous columns and the one-hot/binary encoding for discrete columns. - Force conditioning (optional): If
force_conditioning=True, the conditioned column values are directly overwritten in the output DataFrame, bypassing the stochastic nature of GAN generation for those columns.
A key design choice is that the generator is switched to eval mode during sampling (disabling dropout and using running statistics for batch normalization) and switched back to train mode afterwards.
Usage
Use this principle when generating synthetic data after training an ACTGAN model. The sampling API supports:
- Unconditional generation:
model.sample(n)generatesnrows without any conditions - Conditional generation:
model.sample(n, conditions={"column_name": value})generates rows where the specified columns are biased toward the given values - Force conditioning: When
force_conditioning=Truewas set during model initialization, conditioned columns are directly set to the requested values in the output
Theoretical Basis
The sampling process follows the reverse path of the training data flow:
Sampling Pipeline:
z ~ N(0, I) # Random noise of shape [batch_size, embedding_dim]
cond_vec = encode(conditions) # Conditional vector or zero vector
input = concat(z, cond_vec) # Generator input
raw_output = G(input) # Raw generator output
activated = apply_activations(raw_output) # Per-column activation functions
original_data = inverse_transform(activated) # Reverse data transformation
Conditional Vector Construction
For SINGLE_DISCRETE mode (no user conditions):
Sample from the original conditional vector distribution learned during training.
This selects one discrete column and one category within it, weighted by log-frequency.
For ANYWAY mode with user conditions:
For each column in the data:
If column is in conditions:
Encode the condition value using DataTransformer
Else:
Fill with zeros (unconditioned)
Repeat the conditional vector for every row in the batch.
For ANYWAY mode without conditions:
Use a zero vector of shape [batch_size, cond_vec_dim].
Inverse Transformation
The inverse transformation reverses the DataTransformer encoding:
For each column in column_transform_info_list:
Extract the column's slice from the encoded data
If continuous:
Take the normalized value and component one-hot vector
Select component via argmax
Apply reverse BayesianGMM normalization
If discrete:
Apply reverse OneHotEncoder or BinaryEncoder transform
Reconstruct DataFrame with original column names and dtypes.