Principle:Gretelai Gretel synthetics DataFrame Upsampling
| Knowledge Sources | |
|---|---|
| Domains | Data_Preprocessing, Data_Augmentation |
| Last Updated | 2026-02-14 20:00 GMT |
Overview
Data augmentation technique that ensures a training dataset meets a minimum size by repeating existing records.
Description
DataFrame Upsampling addresses the common problem of small training datasets. Many generative models require a minimum number of training examples to converge effectively. When the available data falls below this threshold, upsampling provides a simple deterministic strategy: repeat the entire dataset as many times as needed, then sample additional rows to reach the exact target size.
Unlike more sophisticated data augmentation techniques (e.g., SMOTE for tabular data or text augmentation), this approach makes no assumptions about the data distribution and introduces no synthetic variation. It is equivalent to training with the same data for multiple epochs, but operates at the data level rather than the training loop level. This makes it model-agnostic and applicable to any downstream training pipeline.
The key trade-off is that upsampling by repetition does not increase the diversity of training examples. It ensures the model sees each example roughly equally often, but does not help with generalization beyond what the original data provides.
Usage
Apply this principle when a training dataset has fewer rows than required by the downstream model or pipeline. It is particularly relevant for the LSTM text generation pipeline in gretel-synthetics, where small training sets may not produce stable models.
Theoretical Basis
The upsampling algorithm is straightforward:
Pseudo-code:
# Abstract algorithm (NOT real implementation)
if len(data) >= target_size:
return data # no action needed
# Repeat full copies
repeat_count = target_size // len(data)
result = concat([data] * repeat_count)
# Sample remainder
remainder = target_size - len(result)
if remainder > 0:
result = concat([result, sample(result, remainder)])
return result
The algorithm guarantees:
- Every original row appears at least
floor(target_size / original_size)times - The remainder rows are drawn uniformly at random from the repeated data
- The output has exactly
target_sizerows - No new data is synthesized; all rows are copies of originals