Heuristic:Gretelai Gretel synthetics Memory Chunking For Normalization

Knowledge Sources	gretel-synthetics Internal memory profiling of ClusterBasedNormalizer
Domains	Optimization, Tabular_Data
Last Updated	2026-02-14 19:00 GMT

Overview

Memory optimization that chunks data into blocks of 131,072 rows during cluster-based normalization to avoid 30x memory amplification from probability prediction.

Description

The `ClusterBasedNormalizer` in ACTGAN preprocesses continuous columns using a Bayesian Gaussian Mixture model. During the transform step, probability prediction temporarily requires roughly 30x the memory of the input data column (for 10 clusters). Processing an entire large column at once can cause out-of-memory errors or severe memory pressure. The solution is to split the input into chunks of `_MAX_CHUNK = 131,072` rows and process each chunk individually. Additionally, sample selection uses vectorized NumPy expressions instead of a Python for loop for better performance.

Usage

This optimization is applied automatically within the ACTGAN data transformation pipeline. It is most impactful when working with large datasets (> 100K rows) with continuous columns. No user configuration is needed; the chunk size is hardcoded.

The Insight (Rule of Thumb)

Action: Data is automatically chunked into blocks of 131,072 rows during cluster-based normalization.
Value: `_MAX_CHUNK = 131072` rows per chunk. Probability prediction uses ~30x memory of input column.
Trade-off: Slightly more NumPy function call overhead due to chunking, but the impact is negligible compared to the memory savings. Vectorized sampling replaces per-element for loops.

Reasoning

For a dataset with 1 million rows and a float64 column (8 bytes/value), the raw data is ~8MB. Probability prediction for 10 Gaussian mixture clusters temporarily creates a (1M x 10) probability matrix plus intermediate arrays, totaling roughly 240MB (30x amplification). By processing 131,072 rows at a time, peak memory for the probability matrix drops to ~31MB. The additional overhead from multiple NumPy calls is negligible compared to the memory savings, especially on memory-constrained systems.

Code Evidence

Chunk size constant from `actgan/transformers.py:27`:

_MAX_CHUNK = 131072

Class docstring explaining the optimization from `actgan/transformers.py:30-41`:

class ClusterBasedNormalizer(RDTClusterBasedNormalizer):
    """A version of the ClusterBasedNormalizer with improved performance.

    This makes two changes to RDT's version of the ClusterBasedNormalizer:
    - To reduce memory pressure, input is split into chunks of size `_MAX_CHUNK_SIZE`,
      which are then processed individually. This is because probability prediction
      temporarily requires roughly 30x the memory of the data column (for 10 clusters),
      which can add significant memory pressure. This also means more NumPy calls, but
      the impact of this is negligible.
    - Instead of a for loop for sample selection, sample selection is done via a vectorized
      NumPy expression.
    """

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment