Implementation:Gretelai Gretel synthetics DataFrameBatch Create Training Data
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Concrete tool for splitting a source DataFrame into per-batch training CSV files provided by the gretel-synthetics library.
Description
DataFrameBatch.create_training_data() iterates over every Batch object in the batches dictionary, selects the columns belonging to that batch from the source DataFrame, stores a deep copy of the sub-DataFrame on the batch's training_df attribute, and writes the sub-DataFrame to a CSV file at the batch's input_data_path. The CSV is written without headers and without the DataFrame index, using the configured field delimiter.
The underlying Batch dataclass (lines 96-176) is the per-batch state container that holds the checkpoint directory path, training data path, column headers, model configuration, and all generation-related accumulators (gen_data_stream, gen_data_count, gen_data_invalid, validator). It also provides helper methods such as reset_gen_data(), add_valid_data(), get_validator(), and the synthetic_df property for reading back generated data.
Usage
Call this method exactly once after constructing a DataFrameBatch in write mode and before calling train_all_batches() or train_batch(). It must not be called in read mode.
Code Reference
Source Location
- Repository: gretel-synthetics
- File:
src/gretel_synthetics/batch.py - Lines: 1133-1156 (create_training_data), 96-176 (Batch dataclass)
Signature
def create_training_data(self):
"""Split the original DataFrame into N smaller DataFrames. Each
smaller DataFrame will have the same number of rows, but a subset
of the columns from the original DataFrame.
This method iterates over each ``Batch`` object and assigns
a smaller training DataFrame to the ``training_df`` attribute
of the object.
Finally, a training CSV is written to disk in the specific
batch directory
"""
Batch dataclass:
@dataclass
class Batch:
checkpoint_dir: str
input_data_path: str
headers: List[str]
config: LocalConfig
gen_data_count: int = 0
training_df: Type[pd.DataFrame] = field(default_factory=lambda: None, init=False)
gen_data_stream: io.StringIO = field(default_factory=io.StringIO, init=False)
gen_data_invalid: List[GenText] = field(default_factory=list, init=False)
validator: Callable = field(default_factory=lambda: None, init=False)
Import
from gretel_synthetics.batch import DataFrameBatch
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| self | DataFrameBatch | Yes | Must be in write mode with a valid _source_df and populated batches dictionary. |
The method takes no explicit arguments. It reads from:
- self._source_df — the original DataFrame (set during __init__)
- self.batches — the dictionary of Batch objects (each with headers and input_data_path)
- self.config[field_delimiter] — the delimiter string for CSV serialisation
Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) batch.training_df | pd.DataFrame | Deep copy of the column-subset DataFrame stored on each Batch. |
| (side effect) CSV file | file on disk | Headerless, index-free CSV written to batch.input_data_path for each batch. |
Usage Examples
Basic Example
from gretel_synthetics.batch import DataFrameBatch
config = {
"checkpoint_dir": "/tmp/my_model",
"field_delimiter": ",",
"overwrite": True,
}
batcher = DataFrameBatch(df=my_dataframe, batch_size=10, config=config)
# Create per-batch training CSV files
batcher.create_training_data()
# Inspect training data for batch 0
print(batcher.batches[0].training_df.head())
print(f"Training file: {batcher.batches[0].input_data_path}")