Heuristic:Gretelai Gretel synthetics Binary Encoder Cutoff
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Tabular_Data, Time_Series |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Memory optimization technique that switches from one-hot encoding to binary encoding for discrete columns exceeding a cardinality threshold, reducing dimensionality from N to log2(N).
Description
When encoding discrete (categorical) columns for GAN training, one-hot encoding creates a vector of length equal to the number of unique values. For high-cardinality columns (e.g., ZIP codes, product IDs), this can cause massive memory usage and slow training. Binary encoding represents the same information in log2(N) bits instead of N bits, dramatically reducing the encoded dimension. The tradeoff is a slight loss in expressiveness, as binary encoding introduces an implicit ordinal relationship between categories.
Usage
Use this heuristic when working with tabular datasets containing high-cardinality discrete columns. Both ACTGAN and DGAN support a `binary_encoder_cutoff` parameter. If a discrete column has more unique values than this cutoff, it will automatically use binary encoding instead of one-hot encoding. Adjust the cutoff based on your memory constraints and the cardinality of your columns.
The Insight (Rule of Thumb)
- Action: Set `binary_encoder_cutoff` parameter when initializing ACTGAN or DGAN models.
- Value: ACTGAN default = 500, DGAN default = 150. Lower values save more memory but may reduce fidelity for moderate-cardinality columns.
- Trade-off: Binary encoding reduces dimensionality from N to ceil(log2(N)), saving memory and speeding up training, but introduces an implicit ordinal structure that does not exist in the original categorical data.
- Guideline: If a column has > 500 unique values, binary encoding is almost always beneficial. Between 150-500, consider dataset size and available memory.
Reasoning
One-hot encoding a column with 1000 unique values creates a 1000-dimensional vector per row. Binary encoding represents the same column in ceil(log2(1000)) = 10 dimensions. For a dataset with multiple high-cardinality columns, the difference in memory usage and training speed is dramatic. The ACTGAN model defaults to a higher cutoff (500) because its architecture uses conditional vectors that work best with one-hot encoded discrete columns. The DGAN model uses a lower cutoff (150) because time series data often has fewer high-cardinality columns and the LSTM-based architecture is more memory-sensitive.
Code Evidence
ACTGAN binary_encoder_cutoff default from `actgan/actgan_wrapper.py:357`:
binary_encoder_cutoff: int = 500,
DGAN binary_encoder_cutoff default from `timeseries_dgan/config.py:110`:
binary_encoder_cutoff: int = 150
DGAN config docstring explaining the purpose from `timeseries_dgan/config.py:60-62`:
binary_encoder_cutoff: use binary encoder (instead of one hot encoder) for
any column with more than this many unique values. This helps reduce memory
consumption for datasets with a lot of unique values.