Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Gretelai Gretel synthetics Upsample Df

From Leeroopedia
Knowledge Sources
Domains Data_Preprocessing, Utilities
Last Updated 2026-02-14 20:00 GMT

Overview

Concrete tool for ensuring a Pandas DataFrame meets a minimum row count by repeating and sampling existing data.

Description

The upsample_df function and its companion UpsampledDataFrame dataclass provide a simple mechanism to inflate small DataFrames to a target size. When a DataFrame has fewer rows than target_size, the function concatenates full copies of the data and samples additional rows to reach the exact target. The result is wrapped in an UpsampledDataFrame dataclass that tracks the original size and number of added records. If the DataFrame already has enough rows, it is returned unchanged.

This is used internally by the gretel-synthetics library to handle small training datasets, ensuring models receive a sufficient volume of training records.

Usage

Import this function when you need to ensure a DataFrame has a minimum number of rows before passing it to a training pipeline. Particularly useful for small datasets that would otherwise produce poor model quality due to insufficient training examples.

Code Reference

Source Location

Signature

@dataclass
class UpsampledDataFrame:
    df: pd.DataFrame
    """The new upsampled DataFrame."""
    original_size: int
    """The number of records that were originally in the DataFrame."""
    upsample_count: int
    """The number of additional records that were added."""


def upsample_df(df: pd.DataFrame, target_size: int) -> UpsampledDataFrame:
    """
    Given a DataFrame, ensure it has a minimum number of records in it.
    If the number of rows is less than target_size then the data will
    be repeated until reaching exactly target_size.

    Args:
        df: A Pandas DataFrame
        target_size: The target number of rows for the DataFrame

    Returns:
        An instance of UpsampledDataFrame
    """

Import

from gretel_synthetics.utils.data import upsample_df, UpsampledDataFrame

I/O Contract

Inputs

Name Type Required Description
df pd.DataFrame Yes The DataFrame to upsample
target_size int Yes The minimum number of rows desired

Outputs

Name Type Description
result UpsampledDataFrame Dataclass containing the upsampled DataFrame, original size, and upsample count
result.df pd.DataFrame The DataFrame with at least target_size rows (or original if already large enough)
result.original_size int Number of rows in the input DataFrame
result.upsample_count int Number of rows added (0 if no upsampling needed)

Usage Examples

Basic Upsampling

import pandas as pd
from gretel_synthetics.utils.data import upsample_df

# Small dataset with only 5 rows
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "score": [85, 92, 78, 95, 88],
})

# Upsample to at least 100 rows
result = upsample_df(df, target_size=100)

print(f"Original size: {result.original_size}")    # 5
print(f"Upsample count: {result.upsample_count}")  # 95
print(f"Final size: {len(result.df)}")              # 100

No-Op When Large Enough

# If the DataFrame already meets the target, nothing changes
large_df = pd.DataFrame({"x": range(500)})
result = upsample_df(large_df, target_size=100)

print(f"Upsample count: {result.upsample_count}")  # 0
print(f"Same object: {result.df is large_df}")      # True

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment