Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Gretelai Gretel synthetics DataFrame Upsampling

From Leeroopedia
Knowledge Sources
Domains Data_Preprocessing, Data_Augmentation
Last Updated 2026-02-14 20:00 GMT

Overview

Data augmentation technique that ensures a training dataset meets a minimum size by repeating existing records.

Description

DataFrame Upsampling addresses the common problem of small training datasets. Many generative models require a minimum number of training examples to converge effectively. When the available data falls below this threshold, upsampling provides a simple deterministic strategy: repeat the entire dataset as many times as needed, then sample additional rows to reach the exact target size.

Unlike more sophisticated data augmentation techniques (e.g., SMOTE for tabular data or text augmentation), this approach makes no assumptions about the data distribution and introduces no synthetic variation. It is equivalent to training with the same data for multiple epochs, but operates at the data level rather than the training loop level. This makes it model-agnostic and applicable to any downstream training pipeline.

The key trade-off is that upsampling by repetition does not increase the diversity of training examples. It ensures the model sees each example roughly equally often, but does not help with generalization beyond what the original data provides.

Usage

Apply this principle when a training dataset has fewer rows than required by the downstream model or pipeline. It is particularly relevant for the LSTM text generation pipeline in gretel-synthetics, where small training sets may not produce stable models.

Theoretical Basis

The upsampling algorithm is straightforward:

Pseudo-code:

# Abstract algorithm (NOT real implementation)
if len(data) >= target_size:
    return data  # no action needed

# Repeat full copies
repeat_count = target_size // len(data)
result = concat([data] * repeat_count)

# Sample remainder
remainder = target_size - len(result)
if remainder > 0:
    result = concat([result, sample(result, remainder)])

return result

The algorithm guarantees:

  • Every original row appears at least floor(target_size / original_size) times
  • The remainder rows are drawn uniformly at random from the repeated data
  • The output has exactly target_size rows
  • No new data is synthesized; all rows are copies of originals

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment