Implementation:Gretelai Gretel synthetics Upsample Df
| Knowledge Sources | |
|---|---|
| Domains | Data_Preprocessing, Utilities |
| Last Updated | 2026-02-14 20:00 GMT |
Overview
Concrete tool for ensuring a Pandas DataFrame meets a minimum row count by repeating and sampling existing data.
Description
The upsample_df function and its companion UpsampledDataFrame dataclass provide a simple mechanism to inflate small DataFrames to a target size. When a DataFrame has fewer rows than target_size, the function concatenates full copies of the data and samples additional rows to reach the exact target. The result is wrapped in an UpsampledDataFrame dataclass that tracks the original size and number of added records. If the DataFrame already has enough rows, it is returned unchanged.
This is used internally by the gretel-synthetics library to handle small training datasets, ensuring models receive a sufficient volume of training records.
Usage
Import this function when you need to ensure a DataFrame has a minimum number of rows before passing it to a training pipeline. Particularly useful for small datasets that would otherwise produce poor model quality due to insufficient training examples.
Code Reference
Source Location
- Repository: Gretelai_Gretel_synthetics
- File: src/gretel_synthetics/utils/data.py
- Lines: 1-67
Signature
@dataclass
class UpsampledDataFrame:
df: pd.DataFrame
"""The new upsampled DataFrame."""
original_size: int
"""The number of records that were originally in the DataFrame."""
upsample_count: int
"""The number of additional records that were added."""
def upsample_df(df: pd.DataFrame, target_size: int) -> UpsampledDataFrame:
"""
Given a DataFrame, ensure it has a minimum number of records in it.
If the number of rows is less than target_size then the data will
be repeated until reaching exactly target_size.
Args:
df: A Pandas DataFrame
target_size: The target number of rows for the DataFrame
Returns:
An instance of UpsampledDataFrame
"""
Import
from gretel_synthetics.utils.data import upsample_df, UpsampledDataFrame
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| df | pd.DataFrame | Yes | The DataFrame to upsample |
| target_size | int | Yes | The minimum number of rows desired |
Outputs
| Name | Type | Description |
|---|---|---|
| result | UpsampledDataFrame | Dataclass containing the upsampled DataFrame, original size, and upsample count |
| result.df | pd.DataFrame | The DataFrame with at least target_size rows (or original if already large enough)
|
| result.original_size | int | Number of rows in the input DataFrame |
| result.upsample_count | int | Number of rows added (0 if no upsampling needed) |
Usage Examples
Basic Upsampling
import pandas as pd
from gretel_synthetics.utils.data import upsample_df
# Small dataset with only 5 rows
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave", "Eve"],
"score": [85, 92, 78, 95, 88],
})
# Upsample to at least 100 rows
result = upsample_df(df, target_size=100)
print(f"Original size: {result.original_size}") # 5
print(f"Upsample count: {result.upsample_count}") # 95
print(f"Final size: {len(result.df)}") # 100
No-Op When Large Enough
# If the DataFrame already meets the target, nothing changes
large_df = pd.DataFrame({"x": range(500)})
result = upsample_df(large_df, target_size=100)
print(f"Upsample count: {result.upsample_count}") # 0
print(f"Same object: {result.df is large_df}") # True