Heuristic:Sdv dev SDV CTGAN Column Performance
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Synthetic_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Performance optimization: preprocess high-cardinality discrete columns before using CTGANSynthesizer to avoid generating more than 1000 one-hot encoded columns.
Description
CTGANSynthesizer internally one-hot encodes discrete (categorical) columns. When a discrete column has many unique values (e.g., ZIP codes, product IDs), the one-hot encoding generates one column per unique value. If the total number of generated columns exceeds 1000, training becomes extremely slow and memory-intensive. SDV emits a PerformanceAlert when this threshold is exceeded, listing the column-to-generated-column mapping.
Usage
Apply this heuristic when using CTGANSynthesizer or CopulaGAN on datasets that contain high-cardinality categorical columns (columns with many unique values). If SDV prints a PerformanceAlert during `preprocess()` or `fit()`, follow the recommendations below.
The Insight (Rule of Thumb)
- Action: Preprocess high-cardinality discrete columns using `update_transformers` to use a different encoding (e.g., LabelEncoder instead of one-hot), or drop columns that are not necessary to model.
- Threshold: The alert triggers when total generated columns exceed 1000.
- Trade-off: Dropping or re-encoding columns may reduce fidelity for those specific columns, but dramatically improves training speed and memory usage.
- Alternative: Consider using GaussianCopulaSynthesizer instead, which does not one-hot encode and handles high-cardinality columns natively.
Reasoning
CTGAN uses a conditional generator architecture that requires one-hot encoding of all discrete columns. For a column with N unique values, this produces N additional columns in the transformed data. Training a GAN on a matrix with thousands of columns is computationally expensive (O(N²) in the generator/discriminator) and can lead to mode collapse. The 1000-column threshold is an empirically chosen cutoff where training time becomes impractical on typical hardware.
Code Evidence
PerformanceAlert from `sdv/single_table/ctgan.py:277-286`:
print(
'PerformanceAlert: Using the CTGANSynthesizer on this data is not recommended. '
'To model this data, CTGAN will generate a large number of columns.'
'\n\n'
f'{generated_columns_str}'
'\n\n'
'We recommend preprocessing discrete columns that can have many values, '
"using 'update_transformers'. Or you may drop columns that are not necessary "
'to model. (Exit this script using ctrl-C)'
)
OneHotEncoder warning from `sdv/single_table/copulas.py:173-176`:
if isinstance(transformer, OneHotEncoder):
warnings.warn(
f"Using a OneHotEncoder transformer for column '{column}' "
'may slow down the preprocessing and modeling times.'
)