Principle:Gretelai Gretel synthetics Batch Synthetic Generation
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Batch synthetic generation is the process of sampling new rows from each independently trained column-cluster model, applying per-batch and per-line validation, and accumulating the results in in-memory buffers for later reassembly.
Description
Once every batch model has been trained, the generation phase produces synthetic data by invoking the low-level generate_text() function for each batch. Generated lines are streamed through a validation pipeline and, if they pass, appended to the batch's in-memory StringIO buffer. Invalid lines are counted and optionally collected for debugging.
The generation process is governed by several interacting controls:
- num_lines — The target number of valid lines to generate per batch. If not specified, the value from the batch's gen_lines config parameter is used. When seed_fields is a list, num_lines is automatically set to the length of that list.
- max_invalid — A hard cap on the number of invalid lines tolerated per batch. If exceeded, a TooManyInvalidError is raised (or silently caught, depending on raise_on_exceed_invalid).
- Validation — Each batch has a validator callable. If a custom validator was set via set_batch_validator(), that callable is used. Otherwise, a built-in validator checks that the number of delimiter-separated values in the generated line matches the number of headers for the batch.
- Seed fields — An optional dictionary (or list of dictionaries) mapping column names to initial values. Seeds are validated to ensure they align with the first N columns of batch 0. For a list of seeds, a 1:1 ratio is enforced: each seed produces exactly one generated line.
- Parallelism — The number of concurrent generation workers. A value of 0 means "use as many workers as CPUs"; a value of 1 disables parallelism.
- Progress tracking — Two tqdm progress bars report valid and invalid line counts in real time.
Each generated line that passes validation is added to the batch via Batch.add_valid_data(), which writes the text into the gen_data_stream StringIO buffer and increments gen_data_count. The buffer is initialised (via reset_gen_data()) with a header row so that it can later be read back as a proper CSV.
A GenerationSummary object is returned for each batch, reporting valid_lines, invalid_lines, and a boolean is_valid indicating whether the target count was fully met.
Usage
Use batch synthetic generation after all models have been trained and before calling batches_to_df() to reassemble the synthetic DataFrame. You may generate for all batches at once with generate_all_batch_lines() or target individual batches with generate_batch_lines(idx).
Theoretical Basis
Each trained model M_i defines a conditional probability distribution over token sequences. Generation samples from this distribution auto-regressively:
function generate_batch_lines(batch, num_lines, max_invalid):
reset batch buffer (write header row)
valid = 0, invalid = 0
for line in generate_text(batch.config, validator, max_invalid, num_lines):
if line.valid:
append line.text to batch.gen_data_stream
valid += 1
else:
record as invalid
invalid += 1
if invalid > max_invalid:
raise TooManyInvalidError or return early
return GenerationSummary(valid, invalid, valid >= num_lines)
Validation acts as rejection sampling: lines are drawn from the model and accepted only if they pass the validator. The max_invalid threshold prevents infinite loops when a model produces mostly invalid output.
Seeded generation constrains the model's first tokens to a user-specified prefix, steering the distribution toward specific record prefixes. This is useful for conditional generation (e.g., generating records for a given customer ID).
Parallelism distributes independent sampling across multiple worker threads, each feeding into a shared iterator that the caller consumes sequentially.