Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Gretelai Gretel synthetics Batch Synthetic Generation

From Leeroopedia
Knowledge Sources
Domains Synthetic_Data, Tabular_Data
Last Updated 2026-02-14 19:00 GMT

Overview

Batch synthetic generation is the process of sampling new rows from each independently trained column-cluster model, applying per-batch and per-line validation, and accumulating the results in in-memory buffers for later reassembly.

Description

Once every batch model has been trained, the generation phase produces synthetic data by invoking the low-level generate_text() function for each batch. Generated lines are streamed through a validation pipeline and, if they pass, appended to the batch's in-memory StringIO buffer. Invalid lines are counted and optionally collected for debugging.

The generation process is governed by several interacting controls:

  • num_lines — The target number of valid lines to generate per batch. If not specified, the value from the batch's gen_lines config parameter is used. When seed_fields is a list, num_lines is automatically set to the length of that list.
  • max_invalid — A hard cap on the number of invalid lines tolerated per batch. If exceeded, a TooManyInvalidError is raised (or silently caught, depending on raise_on_exceed_invalid).
  • Validation — Each batch has a validator callable. If a custom validator was set via set_batch_validator(), that callable is used. Otherwise, a built-in validator checks that the number of delimiter-separated values in the generated line matches the number of headers for the batch.
  • Seed fields — An optional dictionary (or list of dictionaries) mapping column names to initial values. Seeds are validated to ensure they align with the first N columns of batch 0. For a list of seeds, a 1:1 ratio is enforced: each seed produces exactly one generated line.
  • Parallelism — The number of concurrent generation workers. A value of 0 means "use as many workers as CPUs"; a value of 1 disables parallelism.
  • Progress tracking — Two tqdm progress bars report valid and invalid line counts in real time.

Each generated line that passes validation is added to the batch via Batch.add_valid_data(), which writes the text into the gen_data_stream StringIO buffer and increments gen_data_count. The buffer is initialised (via reset_gen_data()) with a header row so that it can later be read back as a proper CSV.

A GenerationSummary object is returned for each batch, reporting valid_lines, invalid_lines, and a boolean is_valid indicating whether the target count was fully met.

Usage

Use batch synthetic generation after all models have been trained and before calling batches_to_df() to reassemble the synthetic DataFrame. You may generate for all batches at once with generate_all_batch_lines() or target individual batches with generate_batch_lines(idx).

Theoretical Basis

Each trained model M_i defines a conditional probability distribution over token sequences. Generation samples from this distribution auto-regressively:

function generate_batch_lines(batch, num_lines, max_invalid):
    reset batch buffer (write header row)
    valid = 0, invalid = 0
    for line in generate_text(batch.config, validator, max_invalid, num_lines):
        if line.valid:
            append line.text to batch.gen_data_stream
            valid += 1
        else:
            record as invalid
            invalid += 1
        if invalid > max_invalid:
            raise TooManyInvalidError or return early
    return GenerationSummary(valid, invalid, valid >= num_lines)

Validation acts as rejection sampling: lines are drawn from the model and accepted only if they pass the validator. The max_invalid threshold prevents infinite loops when a model produces mostly invalid output.

Seeded generation constrains the model's first tokens to a user-specified prefix, steering the distribution toward specific record prefixes. This is useful for conditional generation (e.g., generating records for a given customer ID).

Parallelism distributes independent sampling across multiple worker threads, each feeding into a shared iterator that the caller consumes sequentially.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment