Principle:Gretelai Gretel synthetics Batch Data Export
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Batch data export is the process of slicing a source DataFrame into per-batch training CSV files so that each column-cluster model receives only the subset of columns it is responsible for.
Description
After column clustering partitions a wide DataFrame into groups, each group must be materialised as an independent training dataset before a model can be trained on it. Batch data export takes the original DataFrame, selects the columns belonging to each batch, and writes them to a dedicated CSV file inside the batch's checkpoint directory.
Key design decisions in this process:
- Column selection — For each batch, only the columns listed in that batch's headers attribute are extracted from the source DataFrame.
- Deep copy — The extracted sub-DataFrame is deep-copied and stored on the Batch.training_df attribute so downstream code can inspect or transform it without mutating the source.
- Headerless CSV — The CSV is written without a header row (
header=False) and without the index (index=False). The field delimiter configured in the batch config is used. This format matches what the underlying language model tokeniser expects as raw line-oriented training text. - NaN handling — During DataFrameBatch.__init__, the source DataFrame's NaN values are filled with empty strings (
fillna("")), ensuring that no literal "NaN" tokens leak into the training data.
Each Batch dataclass tracks the state needed for every phase of the workflow:
- checkpoint_dir — path to the batch's directory
- input_data_path — path to the training CSV file
- headers — ordered list of column names for this batch
- config — the per-batch model configuration
- gen_data_stream / gen_data_count — in-memory buffer and counter for generated lines
- gen_data_invalid — list of invalid generated lines
- validator — optional line-level validation callable
Usage
Use batch data export immediately after constructing a DataFrameBatch in write mode and before training any models. It is a mandatory step: without the per-batch CSV files on disk, the training step will have no input data.
Theoretical Basis
The export step implements a straightforward vertical partitioning of a relational table. Given a table T with columns C_1, ..., C_n and a partitioning P = {P_1, ..., P_k} where each P_i is a subset of {C_1, ..., C_n} and their union covers all columns, the export creates k projections:
for each P_i in P:
T_i = project(T, P_i)
write_csv(T_i, batch_dir_i / "train.csv")
The CSV is written as a delimiter-separated values file without headers because the character-level model treats each row as a single training sequence. The delimiter (e.g., comma, pipe) becomes part of the token vocabulary learned during tokeniser training.