Implementation:Gretelai Gretel synthetics DataFrameBatch Generate All Batch Lines
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Concrete tool for generating synthetic lines from all trained batch models provided by the gretel-synthetics library.
Description
DataFrameBatch.generate_all_batch_lines() iterates over every batch index in the batches dictionary and delegates to generate_batch_lines() for each one, passing through the max_invalid, num_lines, seed_fields, and parallelism parameters. It collects the returned GenerationSummary objects into a dictionary keyed by batch index.
DataFrameBatch.generate_batch_lines(batch_idx) is the per-batch generation workhorse. It:
- Looks up the Batch object for the given index.
- If batch_idx is 0 and seed_fields is provided, validates seed values against the batch headers.
- Resets the batch's generation buffer via reset_gen_data().
- Retrieves the batch's validator (custom or built-in column-count checker).
- Determines the target line count from num_lines or the batch config's gen_lines.
- Creates two tqdm progress bars for valid and invalid counts.
- Iterates over the generate_text() generator, appending valid lines to the batch buffer and tracking invalid lines.
- Catches TooManyInvalidError; either re-raises it (if raise_on_exceed_invalid is True) or returns the partial summary.
- Returns a GenerationSummary with valid_lines, invalid_lines, and is_valid.
Usage
Call generate_all_batch_lines() after training to generate synthetic data for all batches at once. Call generate_batch_lines(idx) to generate for a single batch, which is useful when experimenting with individual batch settings or rerunning a failed batch.
Code Reference
Source Location
- Repository: gretel-synthetics
- File:
src/gretel_synthetics/batch.py - Lines: 1329-1390 (generate_all_batch_lines), 1216-1302 (generate_batch_lines)
Signature
def generate_all_batch_lines(
self,
max_invalid=MAX_INVALID,
raise_on_failed_batch: bool = False,
num_lines: int = None,
seed_fields: Union[dict, List[dict]] = None,
parallelism: int = 0,
) -> Dict[int, GenerationSummary]:
def generate_batch_lines(
self,
batch_idx: int,
max_invalid=MAX_INVALID,
raise_on_exceed_invalid: bool = False,
num_lines: int = None,
seed_fields: Union[dict, List[dict]] = None,
parallelism: int = 0,
) -> GenerationSummary:
Import
from gretel_synthetics.batch import DataFrameBatch
I/O Contract
Inputs
generate_all_batch_lines:
| Name | Type | Required | Description |
|---|---|---|---|
| max_invalid | int | No (default 1000) | Max invalid lines tolerated per batch before stopping. |
| raise_on_failed_batch | bool | No (default False) | If True, re-raise TooManyInvalidError on batch failure; otherwise return partial results. |
| num_lines | int | No | Target valid lines per batch. Defaults to config.gen_lines. Overridden by len(seed_fields) when seed_fields is a list. |
| seed_fields | Union[dict, List[dict]] | No | Seed values for conditional generation on batch 0. |
| parallelism | int | No (default 0) | Number of concurrent workers; 0 means number of CPUs. |
generate_batch_lines:
| Name | Type | Required | Description |
|---|---|---|---|
| batch_idx | int | Yes | Index of the batch to generate lines for. |
| max_invalid | int | No (default 1000) | Max invalid lines tolerated. |
| raise_on_exceed_invalid | bool | No (default False) | Whether to re-raise on exceeding max_invalid. |
| num_lines | int | No | Target number of valid lines. |
| seed_fields | Union[dict, List[dict]] | No | Seed values for conditional generation. |
| parallelism | int | No (default 0) | Number of concurrent workers. |
Outputs
| Name | Type | Description |
|---|---|---|
| generate_all_batch_lines return | Dict[int, GenerationSummary] | Maps each batch index to its GenerationSummary (valid_lines, invalid_lines, is_valid). |
| generate_batch_lines return | GenerationSummary | Summary for a single batch: valid_lines (int), invalid_lines (int), is_valid (bool). |
| (side effect) batch.gen_data_stream | io.StringIO | Each batch's in-memory buffer is populated with generated CSV rows. |
| (side effect) batch.gen_data_count | int | Updated count of valid generated lines per batch. |
Usage Examples
Basic Example: Generate All
from gretel_synthetics.batch import DataFrameBatch
# Assume batcher is already trained
status = batcher.generate_all_batch_lines(num_lines=500)
for idx, summary in status.items():
print(f"Batch {idx}: {summary.valid_lines} valid, "
f"{summary.invalid_lines} invalid, "
f"is_valid={summary.is_valid}")
Seeded Generation
# Generate with seed values for the first batch
seeds = [
{"customer_id": "C001", "region": "US"},
{"customer_id": "C002", "region": "EU"},
]
status = batcher.generate_all_batch_lines(seed_fields=seeds)
Single Batch Generation
summary = batcher.generate_batch_lines(
batch_idx=0,
num_lines=100,
max_invalid=500,
raise_on_exceed_invalid=True,
)
print(f"Generated {summary.valid_lines} valid lines for batch 0")