Heuristic:Gretelai Gretel synthetics Memory Chunking For Normalization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Memory optimization that chunks data into blocks of 131,072 rows during cluster-based normalization to avoid 30x memory amplification from probability prediction.
Description
The `ClusterBasedNormalizer` in ACTGAN preprocesses continuous columns using a Bayesian Gaussian Mixture model. During the transform step, probability prediction temporarily requires roughly 30x the memory of the input data column (for 10 clusters). Processing an entire large column at once can cause out-of-memory errors or severe memory pressure. The solution is to split the input into chunks of `_MAX_CHUNK = 131,072` rows and process each chunk individually. Additionally, sample selection uses vectorized NumPy expressions instead of a Python for loop for better performance.
Usage
This optimization is applied automatically within the ACTGAN data transformation pipeline. It is most impactful when working with large datasets (> 100K rows) with continuous columns. No user configuration is needed; the chunk size is hardcoded.
The Insight (Rule of Thumb)
- Action: Data is automatically chunked into blocks of 131,072 rows during cluster-based normalization.
- Value: `_MAX_CHUNK = 131072` rows per chunk. Probability prediction uses ~30x memory of input column.
- Trade-off: Slightly more NumPy function call overhead due to chunking, but the impact is negligible compared to the memory savings. Vectorized sampling replaces per-element for loops.
Reasoning
For a dataset with 1 million rows and a float64 column (8 bytes/value), the raw data is ~8MB. Probability prediction for 10 Gaussian mixture clusters temporarily creates a (1M x 10) probability matrix plus intermediate arrays, totaling roughly 240MB (30x amplification). By processing 131,072 rows at a time, peak memory for the probability matrix drops to ~31MB. The additional overhead from multiple NumPy calls is negligible compared to the memory savings, especially on memory-constrained systems.
Code Evidence
Chunk size constant from `actgan/transformers.py:27`:
_MAX_CHUNK = 131072
Class docstring explaining the optimization from `actgan/transformers.py:30-41`:
class ClusterBasedNormalizer(RDTClusterBasedNormalizer):
"""A version of the ClusterBasedNormalizer with improved performance.
This makes two changes to RDT's version of the ClusterBasedNormalizer:
- To reduce memory pressure, input is split into chunks of size `_MAX_CHUNK_SIZE`,
which are then processed individually. This is because probability prediction
temporarily requires roughly 30x the memory of the data column (for 10 clusters),
which can add significant memory pressure. This also means more NumPy calls, but
the impact of this is negligible.
- Instead of a for loop for sample selection, sample selection is done via a vectorized
NumPy expression.
"""