Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Sdv dev SDV CTGAN Column Performance

From Leeroopedia
Knowledge Sources
Domains Optimization, Synthetic_Data
Last Updated 2026-02-14 19:00 GMT

Overview

Performance optimization: preprocess high-cardinality discrete columns before using CTGANSynthesizer to avoid generating more than 1000 one-hot encoded columns.

Description

CTGANSynthesizer internally one-hot encodes discrete (categorical) columns. When a discrete column has many unique values (e.g., ZIP codes, product IDs), the one-hot encoding generates one column per unique value. If the total number of generated columns exceeds 1000, training becomes extremely slow and memory-intensive. SDV emits a PerformanceAlert when this threshold is exceeded, listing the column-to-generated-column mapping.

Usage

Apply this heuristic when using CTGANSynthesizer or CopulaGAN on datasets that contain high-cardinality categorical columns (columns with many unique values). If SDV prints a PerformanceAlert during `preprocess()` or `fit()`, follow the recommendations below.

The Insight (Rule of Thumb)

  • Action: Preprocess high-cardinality discrete columns using `update_transformers` to use a different encoding (e.g., LabelEncoder instead of one-hot), or drop columns that are not necessary to model.
  • Threshold: The alert triggers when total generated columns exceed 1000.
  • Trade-off: Dropping or re-encoding columns may reduce fidelity for those specific columns, but dramatically improves training speed and memory usage.
  • Alternative: Consider using GaussianCopulaSynthesizer instead, which does not one-hot encode and handles high-cardinality columns natively.

Reasoning

CTGAN uses a conditional generator architecture that requires one-hot encoding of all discrete columns. For a column with N unique values, this produces N additional columns in the transformed data. Training a GAN on a matrix with thousands of columns is computationally expensive (O(N²) in the generator/discriminator) and can lead to mode collapse. The 1000-column threshold is an empirically chosen cutoff where training time becomes impractical on typical hardware.

Code Evidence

PerformanceAlert from `sdv/single_table/ctgan.py:277-286`:

print(
    'PerformanceAlert: Using the CTGANSynthesizer on this data is not recommended. '
    'To model this data, CTGAN will generate a large number of columns.'
    '\n\n'
    f'{generated_columns_str}'
    '\n\n'
    'We recommend preprocessing discrete columns that can have many values, '
    "using 'update_transformers'. Or you may drop columns that are not necessary "
    'to model. (Exit this script using ctrl-C)'
)

OneHotEncoder warning from `sdv/single_table/copulas.py:173-176`:

if isinstance(transformer, OneHotEncoder):
    warnings.warn(
        f"Using a OneHotEncoder transformer for column '{column}' "
        'may slow down the preprocessing and modeling times.'
    )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment