Principle:Gretelai Gretel synthetics Tabular Data Transformation
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, GAN, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Tabular data transformation is the process of converting heterogeneous raw tabular data into a normalized numerical representation suitable for training a generative adversarial network.
Description
Real-world tabular datasets contain a mix of continuous numerical columns, categorical (discrete) columns, and datetime columns, each with different ranges, distributions, and cardinalities. GANs require homogeneous numerical tensors as input. Tabular data transformation bridges this gap through a multi-stage pipeline:
- Datetime detection and conversion: Columns that appear to be datetime values are identified (either through explicit metadata or auto-detection) and converted to Unix timestamps via SDV's metadata system.
- Empty column handling: Columns that are entirely NaN are detected and replaced with a constant value (0) during training, then restored to NaN during reverse transformation.
- SDV metadata transformation: The SDV framework applies field-level transformers (FrequencyEncoder, OneHotEncoder, LabelEncoder, BinaryEncoder, FloatFormatter, UnixTimestampEncoder) based on the configured metadata, converting the original DataFrame into a purely numerical intermediate form.
- DataTransformer fitting: Continuous columns are modeled using a Bayesian Gaussian Mixture Model (ClusterBasedNormalizer) that learns the multi-modal distribution and normalizes values to a scalar plus a one-hot component vector. Discrete columns are encoded using either OneHotEncoder (for columns with fewer unique values than a cutoff) or BinaryEncodingTransformer (for high-cardinality columns).
- Categorical column identification: After SDV metadata transformation, columns are classified as categorical or continuous by inspecting their dtype and value patterns (e.g., float columns containing only 0.0 and 1.0 are treated as boolean/categorical).
The result is a compact decoded representation (TrainData) where each column is stored as integer indices or normalized floats, ready to be encoded into the final tensor form consumed by the neural network.
Usage
This transformation pipeline executes automatically when ACTGAN.fit(data) is called. It is the mandatory pre-processing step before GAN training. Understanding this pipeline is essential for:
- Configuring
field_typesandfield_transformersto control how specific columns are handled - Setting
auto_transform_datetimes=Truewhen the dataset contains date/time columns - Tuning
binary_encoder_cutoffto control the threshold between OHE and binary encoding for discrete columns - Adjusting
cbn_sample_sizeto balance between accuracy and speed of the Bayesian GMM fitting for continuous columns
Theoretical Basis
The transformation follows the mode-specific normalization approach from the CTGAN paper. For each continuous column:
1. Fit a Bayesian Gaussian Mixture Model with up to max_clusters components.
2. Discard components with weight below weight_threshold.
3. For each value x in the column:
a. Compute the probability of x belonging to each remaining component.
b. Select the most likely component k.
c. Normalize x within the selected component:
x_normalized = (x - mu_k) / (4 * sigma_k)
d. Encode the component assignment as a one-hot vector.
4. Output: [normalized_value, component_one_hot_vector]
For discrete columns, the encoding depends on cardinality:
If num_unique_values < binary_encoder_cutoff:
Apply OneHotEncoder: output is a vector of length num_unique_values
Else:
Apply BinaryEncodingTransformer: output is a binary vector of length ceil(log2(num_unique_values))
The datetime auto-detection workflow scans each column in the DataFrame, checks if values can be parsed as dates, and if so, creates a UnixTimestampEncoder with the inferred format string:
For each column in data:
If column values match a datetime pattern:
field_types[column] = {"type": "datetime", "format": inferred_format}
field_transformers[column] = UnixTimestampEncoder(datetime_format=inferred_format)