Principle:Gretelai Gretel synthetics Batch Data Export

Knowledge Sources	gretel-synthetics
Domains	Synthetic_Data, Tabular_Data
Last Updated	2026-02-14 19:00 GMT

Overview

Batch data export is the process of slicing a source DataFrame into per-batch training CSV files so that each column-cluster model receives only the subset of columns it is responsible for.

Description

After column clustering partitions a wide DataFrame into groups, each group must be materialised as an independent training dataset before a model can be trained on it. Batch data export takes the original DataFrame, selects the columns belonging to each batch, and writes them to a dedicated CSV file inside the batch's checkpoint directory.

Key design decisions in this process:

Column selection — For each batch, only the columns listed in that batch's headers attribute are extracted from the source DataFrame.
Deep copy — The extracted sub-DataFrame is deep-copied and stored on the Batch.training_df attribute so downstream code can inspect or transform it without mutating the source.
Headerless CSV — The CSV is written without a header row (header=False) and without the index (index=False). The field delimiter configured in the batch config is used. This format matches what the underlying language model tokeniser expects as raw line-oriented training text.
NaN handling — During DataFrameBatch.__init__, the source DataFrame's NaN values are filled with empty strings (fillna("")), ensuring that no literal "NaN" tokens leak into the training data.

Each Batch dataclass tracks the state needed for every phase of the workflow:

checkpoint_dir — path to the batch's directory
input_data_path — path to the training CSV file
headers — ordered list of column names for this batch
config — the per-batch model configuration
gen_data_stream / gen_data_count — in-memory buffer and counter for generated lines
gen_data_invalid — list of invalid generated lines
validator — optional line-level validation callable

Usage

Use batch data export immediately after constructing a DataFrameBatch in write mode and before training any models. It is a mandatory step: without the per-batch CSV files on disk, the training step will have no input data.

Theoretical Basis

The export step implements a straightforward vertical partitioning of a relational table. Given a table T with columns C_1, ..., C_n and a partitioning P = {P_1, ..., P_k} where each P_i is a subset of {C_1, ..., C_n} and their union covers all columns, the export creates k projections:

for each P_i in P:
    T_i = project(T, P_i)
    write_csv(T_i, batch_dir_i / "train.csv")

The CSV is written as a delimiter-separated values file without headers because the character-level model treats each row as a single training sequence. The delimiter (e.g., comma, pipe) becomes part of the token vocabulary learned during tokeniser training.

Related Pages

Implemented By

Implementation:Gretelai_Gretel_synthetics_DataFrameBatch_Create_Training_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment