Heuristic:Gretelai Gretel synthetics Binary Encoder Cutoff

Knowledge Sources	gretel-synthetics Internal optimization for high-cardinality columns
Domains	Optimization, Tabular_Data, Time_Series
Last Updated	2026-02-14 19:00 GMT

Overview

Memory optimization technique that switches from one-hot encoding to binary encoding for discrete columns exceeding a cardinality threshold, reducing dimensionality from N to log2(N).

Description

When encoding discrete (categorical) columns for GAN training, one-hot encoding creates a vector of length equal to the number of unique values. For high-cardinality columns (e.g., ZIP codes, product IDs), this can cause massive memory usage and slow training. Binary encoding represents the same information in log2(N) bits instead of N bits, dramatically reducing the encoded dimension. The tradeoff is a slight loss in expressiveness, as binary encoding introduces an implicit ordinal relationship between categories.

Usage

Use this heuristic when working with tabular datasets containing high-cardinality discrete columns. Both ACTGAN and DGAN support a `binary_encoder_cutoff` parameter. If a discrete column has more unique values than this cutoff, it will automatically use binary encoding instead of one-hot encoding. Adjust the cutoff based on your memory constraints and the cardinality of your columns.

The Insight (Rule of Thumb)

Action: Set `binary_encoder_cutoff` parameter when initializing ACTGAN or DGAN models.
Value: ACTGAN default = 500, DGAN default = 150. Lower values save more memory but may reduce fidelity for moderate-cardinality columns.
Trade-off: Binary encoding reduces dimensionality from N to ceil(log2(N)), saving memory and speeding up training, but introduces an implicit ordinal structure that does not exist in the original categorical data.
Guideline: If a column has > 500 unique values, binary encoding is almost always beneficial. Between 150-500, consider dataset size and available memory.

Reasoning

One-hot encoding a column with 1000 unique values creates a 1000-dimensional vector per row. Binary encoding represents the same column in ceil(log2(1000)) = 10 dimensions. For a dataset with multiple high-cardinality columns, the difference in memory usage and training speed is dramatic. The ACTGAN model defaults to a higher cutoff (500) because its architecture uses conditional vectors that work best with one-hot encoded discrete columns. The DGAN model uses a lower cutoff (150) because time series data often has fewer high-cardinality columns and the LSTM-based architecture is more memory-sensitive.

Code Evidence

ACTGAN binary_encoder_cutoff default from `actgan/actgan_wrapper.py:357`:

binary_encoder_cutoff: int = 500,

DGAN binary_encoder_cutoff default from `timeseries_dgan/config.py:110`:

binary_encoder_cutoff: int = 150

DGAN config docstring explaining the purpose from `timeseries_dgan/config.py:60-62`:

binary_encoder_cutoff: use binary encoder (instead of one hot encoder) for
    any column with more than this many unique values. This helps reduce memory
    consumption for datasets with a lot of unique values.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment