Principle:Gretelai Gretel synthetics Batch Model Training

Knowledge Sources	gretel-synthetics
Domains	Synthetic_Data, Tabular_Data
Last Updated	2026-02-14 19:00 GMT

Overview

Batch model training is the strategy of training one independent generative model per column-cluster batch, so that each model learns only the joint distribution of its assigned subset of columns.

Description

In the batch synthesis workflow, a wide DataFrame has already been partitioned into k column groups (batches), and each batch has its own training CSV on disk. Batch model training iterates over every batch and invokes the underlying train() function, which handles tokeniser training, model construction, and weight optimisation for that batch's data.

Key design properties:

Independence — Each batch trains its own model in its own checkpoint directory. Models share no weights and can in principle be trained in parallel (though the current implementation trains them sequentially).
Configuration isolation — During DataFrameBatch.__init__, the template configuration is deep-copied for every batch and updated with the batch-specific checkpoint_dir and input_data_path. This ensures that each model writes its checkpoints, tokeniser artifacts, and logs to the correct location without cross-contamination.
Custom tokeniser support — An optional BaseTokenizerTrainer subclass can be passed to the DataFrameBatch. When training a batch, the tokeniser is deep-copied and its config is pointed at the current batch's config, allowing per-batch tokeniser customisation.
Epoch callbacks — If the user supplies an epoch callback in the config, it is wrapped in a _BatchEpochCallback that injects the batch number into the EpochState object, making it possible to monitor which batch is currently training.
Write-mode guard — Both train_batch() and train_all_batches() raise a RuntimeError if called in read mode, because training is only valid when batch directories have been freshly created.

Usage

Use batch model training after create_training_data() has written the per-batch CSVs and before any generation step. You may call train_all_batches() to train every batch in sequence, or train_batch(idx) to selectively train or re-train a single batch.

Theoretical Basis

Batch model training applies the principle of divide and conquer to generative modelling of tabular data. Instead of learning the full joint distribution P(C_1, C_2, ..., C_n) over all n columns, we approximate it as:

P(C_1, ..., C_n) ~ P_1(B_1) * P_2(B_2) * ... * P_k(B_k)

where each B_i is a batch (a subset of columns) and P_i is the model learned for that batch. This factorisation assumes conditional independence between batches given the row identity, which is an approximation. Column clustering mitigates the approximation error by grouping correlated columns together so that the strongest dependencies are captured within a single model.

Training loop pseudocode:

function train_all_batches(batches):
    log_batch_sizes(batches)
    for idx in batches:
        tokenizer_copy = deep_copy(tokenizer) if tokenizer else None
        if tokenizer_copy:
            tokenizer_copy.config = batches[idx].config
        train(batches[idx].config, tokenizer_copy)

Each call to train() internally:

Trains or loads the tokeniser on the batch's CSV data.
Constructs the neural network (e.g., LSTM or Transformer) based on the config.
Runs gradient-descent optimisation over the tokenised training sequences for the configured number of epochs.
Saves model checkpoints to the batch's checkpoint directory.

Related Pages

Implemented By

Implementation:Gretelai_Gretel_synthetics_DataFrameBatch_Train_All_Batches

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment