Implementation:Gretelai Gretel synthetics DataFrameBatch Create Training Data

Knowledge Sources	gretel-synthetics
Domains	Synthetic_Data, Tabular_Data
Last Updated	2026-02-14 19:00 GMT

Overview

Concrete tool for splitting a source DataFrame into per-batch training CSV files provided by the gretel-synthetics library.

Description

DataFrameBatch.create_training_data() iterates over every Batch object in the batches dictionary, selects the columns belonging to that batch from the source DataFrame, stores a deep copy of the sub-DataFrame on the batch's training_df attribute, and writes the sub-DataFrame to a CSV file at the batch's input_data_path. The CSV is written without headers and without the DataFrame index, using the configured field delimiter.

The underlying Batch dataclass (lines 96-176) is the per-batch state container that holds the checkpoint directory path, training data path, column headers, model configuration, and all generation-related accumulators (gen_data_stream, gen_data_count, gen_data_invalid, validator). It also provides helper methods such as reset_gen_data(), add_valid_data(), get_validator(), and the synthetic_df property for reading back generated data.

Usage

Call this method exactly once after constructing a DataFrameBatch in write mode and before calling train_all_batches() or train_batch(). It must not be called in read mode.

Code Reference

Source Location

Repository: gretel-synthetics
File: src/gretel_synthetics/batch.py
Lines: 1133-1156 (create_training_data), 96-176 (Batch dataclass)

Signature

def create_training_data(self):
    """Split the original DataFrame into N smaller DataFrames. Each
    smaller DataFrame will have the same number of rows, but a subset
    of the columns from the original DataFrame.

    This method iterates over each ``Batch`` object and assigns
    a smaller training DataFrame to the ``training_df`` attribute
    of the object.

    Finally, a training CSV is written to disk in the specific
    batch directory
    """

Batch dataclass:

@dataclass
class Batch:
    checkpoint_dir: str
    input_data_path: str
    headers: List[str]
    config: LocalConfig
    gen_data_count: int = 0

    training_df: Type[pd.DataFrame] = field(default_factory=lambda: None, init=False)
    gen_data_stream: io.StringIO = field(default_factory=io.StringIO, init=False)
    gen_data_invalid: List[GenText] = field(default_factory=list, init=False)
    validator: Callable = field(default_factory=lambda: None, init=False)

Import

from gretel_synthetics.batch import DataFrameBatch

I/O Contract

Inputs

Name	Type	Required	Description
self	DataFrameBatch	Yes	Must be in write mode with a valid _source_df and populated batches dictionary.

The method takes no explicit arguments. It reads from:

self._source_df — the original DataFrame (set during __init__)
self.batches — the dictionary of Batch objects (each with headers and input_data_path)
self.config[field_delimiter] — the delimiter string for CSV serialisation

Outputs

Name	Type	Description
(side effect) batch.training_df	pd.DataFrame	Deep copy of the column-subset DataFrame stored on each Batch.
(side effect) CSV file	file on disk	Headerless, index-free CSV written to batch.input_data_path for each batch.

Usage Examples

Basic Example

from gretel_synthetics.batch import DataFrameBatch

config = {
    "checkpoint_dir": "/tmp/my_model",
    "field_delimiter": ",",
    "overwrite": True,
}

batcher = DataFrameBatch(df=my_dataframe, batch_size=10, config=config)

# Create per-batch training CSV files
batcher.create_training_data()

# Inspect training data for batch 0
print(batcher.batches[0].training_df.head())
print(f"Training file: {batcher.batches[0].input_data_path}")

Related Pages

Implements Principle

Principle:Gretelai_Gretel_synthetics_Batch_Data_Export

Requires Environment

Environment:Gretelai_Gretel_synthetics_TensorFlow_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment