Principle:Gretelai Gretel synthetics Tabular Data Transformation

Knowledge Sources	gretel-synthetics CTGAN
Domains	Synthetic_Data, GAN, Tabular_Data
Last Updated	2026-02-14 19:00 GMT

Overview

Tabular data transformation is the process of converting heterogeneous raw tabular data into a normalized numerical representation suitable for training a generative adversarial network.

Description

Real-world tabular datasets contain a mix of continuous numerical columns, categorical (discrete) columns, and datetime columns, each with different ranges, distributions, and cardinalities. GANs require homogeneous numerical tensors as input. Tabular data transformation bridges this gap through a multi-stage pipeline:

Datetime detection and conversion: Columns that appear to be datetime values are identified (either through explicit metadata or auto-detection) and converted to Unix timestamps via SDV's metadata system.
Empty column handling: Columns that are entirely NaN are detected and replaced with a constant value (0) during training, then restored to NaN during reverse transformation.
SDV metadata transformation: The SDV framework applies field-level transformers (FrequencyEncoder, OneHotEncoder, LabelEncoder, BinaryEncoder, FloatFormatter, UnixTimestampEncoder) based on the configured metadata, converting the original DataFrame into a purely numerical intermediate form.
DataTransformer fitting: Continuous columns are modeled using a Bayesian Gaussian Mixture Model (ClusterBasedNormalizer) that learns the multi-modal distribution and normalizes values to a scalar plus a one-hot component vector. Discrete columns are encoded using either OneHotEncoder (for columns with fewer unique values than a cutoff) or BinaryEncodingTransformer (for high-cardinality columns).
Categorical column identification: After SDV metadata transformation, columns are classified as categorical or continuous by inspecting their dtype and value patterns (e.g., float columns containing only 0.0 and 1.0 are treated as boolean/categorical).

The result is a compact decoded representation (TrainData) where each column is stored as integer indices or normalized floats, ready to be encoded into the final tensor form consumed by the neural network.

Usage

This transformation pipeline executes automatically when ACTGAN.fit(data) is called. It is the mandatory pre-processing step before GAN training. Understanding this pipeline is essential for:

Configuring field_types and field_transformers to control how specific columns are handled
Setting auto_transform_datetimes=True when the dataset contains date/time columns
Tuning binary_encoder_cutoff to control the threshold between OHE and binary encoding for discrete columns
Adjusting cbn_sample_size to balance between accuracy and speed of the Bayesian GMM fitting for continuous columns

Theoretical Basis

The transformation follows the mode-specific normalization approach from the CTGAN paper. For each continuous column:

1. Fit a Bayesian Gaussian Mixture Model with up to max_clusters components.
2. Discard components with weight below weight_threshold.
3. For each value x in the column:
   a. Compute the probability of x belonging to each remaining component.
   b. Select the most likely component k.
   c. Normalize x within the selected component:
      x_normalized = (x - mu_k) / (4 * sigma_k)
   d. Encode the component assignment as a one-hot vector.
4. Output: [normalized_value, component_one_hot_vector]

For discrete columns, the encoding depends on cardinality:

If num_unique_values < binary_encoder_cutoff:
    Apply OneHotEncoder: output is a vector of length num_unique_values
Else:
    Apply BinaryEncodingTransformer: output is a binary vector of length ceil(log2(num_unique_values))

The datetime auto-detection workflow scans each column in the DataFrame, checks if values can be parsed as dates, and if so, creates a UnixTimestampEncoder with the inferred format string:

For each column in data:
    If column values match a datetime pattern:
        field_types[column] = {"type": "datetime", "format": inferred_format}
        field_transformers[column] = UnixTimestampEncoder(datetime_format=inferred_format)

Related Pages

Implemented By

Implementation:Gretelai_Gretel_synthetics_ACTGAN_Fit

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment