Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Gretelai Gretel synthetics Batch Data Export

From Leeroopedia
Knowledge Sources
Domains Synthetic_Data, Tabular_Data
Last Updated 2026-02-14 19:00 GMT

Overview

Batch data export is the process of slicing a source DataFrame into per-batch training CSV files so that each column-cluster model receives only the subset of columns it is responsible for.

Description

After column clustering partitions a wide DataFrame into groups, each group must be materialised as an independent training dataset before a model can be trained on it. Batch data export takes the original DataFrame, selects the columns belonging to each batch, and writes them to a dedicated CSV file inside the batch's checkpoint directory.

Key design decisions in this process:

  • Column selection — For each batch, only the columns listed in that batch's headers attribute are extracted from the source DataFrame.
  • Deep copy — The extracted sub-DataFrame is deep-copied and stored on the Batch.training_df attribute so downstream code can inspect or transform it without mutating the source.
  • Headerless CSV — The CSV is written without a header row (header=False) and without the index (index=False). The field delimiter configured in the batch config is used. This format matches what the underlying language model tokeniser expects as raw line-oriented training text.
  • NaN handling — During DataFrameBatch.__init__, the source DataFrame's NaN values are filled with empty strings (fillna("")), ensuring that no literal "NaN" tokens leak into the training data.

Each Batch dataclass tracks the state needed for every phase of the workflow:

  • checkpoint_dir — path to the batch's directory
  • input_data_path — path to the training CSV file
  • headers — ordered list of column names for this batch
  • config — the per-batch model configuration
  • gen_data_stream / gen_data_count — in-memory buffer and counter for generated lines
  • gen_data_invalid — list of invalid generated lines
  • validator — optional line-level validation callable

Usage

Use batch data export immediately after constructing a DataFrameBatch in write mode and before training any models. It is a mandatory step: without the per-batch CSV files on disk, the training step will have no input data.

Theoretical Basis

The export step implements a straightforward vertical partitioning of a relational table. Given a table T with columns C_1, ..., C_n and a partitioning P = {P_1, ..., P_k} where each P_i is a subset of {C_1, ..., C_n} and their union covers all columns, the export creates k projections:

for each P_i in P:
    T_i = project(T, P_i)
    write_csv(T_i, batch_dir_i / "train.csv")

The CSV is written as a delimiter-separated values file without headers because the character-level model treats each row as a single training sequence. The delimiter (e.g., comma, pipe) becomes part of the token vocabulary learned during tokeniser training.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment