Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Gretelai Gretel synthetics DataFrameBatch Generate All Batch Lines

From Leeroopedia
Knowledge Sources
Domains Synthetic_Data, Tabular_Data
Last Updated 2026-02-14 19:00 GMT

Overview

Concrete tool for generating synthetic lines from all trained batch models provided by the gretel-synthetics library.

Description

DataFrameBatch.generate_all_batch_lines() iterates over every batch index in the batches dictionary and delegates to generate_batch_lines() for each one, passing through the max_invalid, num_lines, seed_fields, and parallelism parameters. It collects the returned GenerationSummary objects into a dictionary keyed by batch index.

DataFrameBatch.generate_batch_lines(batch_idx) is the per-batch generation workhorse. It:

  1. Looks up the Batch object for the given index.
  2. If batch_idx is 0 and seed_fields is provided, validates seed values against the batch headers.
  3. Resets the batch's generation buffer via reset_gen_data().
  4. Retrieves the batch's validator (custom or built-in column-count checker).
  5. Determines the target line count from num_lines or the batch config's gen_lines.
  6. Creates two tqdm progress bars for valid and invalid counts.
  7. Iterates over the generate_text() generator, appending valid lines to the batch buffer and tracking invalid lines.
  8. Catches TooManyInvalidError; either re-raises it (if raise_on_exceed_invalid is True) or returns the partial summary.
  9. Returns a GenerationSummary with valid_lines, invalid_lines, and is_valid.

Usage

Call generate_all_batch_lines() after training to generate synthetic data for all batches at once. Call generate_batch_lines(idx) to generate for a single batch, which is useful when experimenting with individual batch settings or rerunning a failed batch.

Code Reference

Source Location

  • Repository: gretel-synthetics
  • File: src/gretel_synthetics/batch.py
  • Lines: 1329-1390 (generate_all_batch_lines), 1216-1302 (generate_batch_lines)

Signature

def generate_all_batch_lines(
    self,
    max_invalid=MAX_INVALID,
    raise_on_failed_batch: bool = False,
    num_lines: int = None,
    seed_fields: Union[dict, List[dict]] = None,
    parallelism: int = 0,
) -> Dict[int, GenerationSummary]:

def generate_batch_lines(
    self,
    batch_idx: int,
    max_invalid=MAX_INVALID,
    raise_on_exceed_invalid: bool = False,
    num_lines: int = None,
    seed_fields: Union[dict, List[dict]] = None,
    parallelism: int = 0,
) -> GenerationSummary:

Import

from gretel_synthetics.batch import DataFrameBatch

I/O Contract

Inputs

generate_all_batch_lines:

Name Type Required Description
max_invalid int No (default 1000) Max invalid lines tolerated per batch before stopping.
raise_on_failed_batch bool No (default False) If True, re-raise TooManyInvalidError on batch failure; otherwise return partial results.
num_lines int No Target valid lines per batch. Defaults to config.gen_lines. Overridden by len(seed_fields) when seed_fields is a list.
seed_fields Union[dict, List[dict]] No Seed values for conditional generation on batch 0.
parallelism int No (default 0) Number of concurrent workers; 0 means number of CPUs.

generate_batch_lines:

Name Type Required Description
batch_idx int Yes Index of the batch to generate lines for.
max_invalid int No (default 1000) Max invalid lines tolerated.
raise_on_exceed_invalid bool No (default False) Whether to re-raise on exceeding max_invalid.
num_lines int No Target number of valid lines.
seed_fields Union[dict, List[dict]] No Seed values for conditional generation.
parallelism int No (default 0) Number of concurrent workers.

Outputs

Name Type Description
generate_all_batch_lines return Dict[int, GenerationSummary] Maps each batch index to its GenerationSummary (valid_lines, invalid_lines, is_valid).
generate_batch_lines return GenerationSummary Summary for a single batch: valid_lines (int), invalid_lines (int), is_valid (bool).
(side effect) batch.gen_data_stream io.StringIO Each batch's in-memory buffer is populated with generated CSV rows.
(side effect) batch.gen_data_count int Updated count of valid generated lines per batch.

Usage Examples

Basic Example: Generate All

from gretel_synthetics.batch import DataFrameBatch

# Assume batcher is already trained
status = batcher.generate_all_batch_lines(num_lines=500)

for idx, summary in status.items():
    print(f"Batch {idx}: {summary.valid_lines} valid, "
          f"{summary.invalid_lines} invalid, "
          f"is_valid={summary.is_valid}")

Seeded Generation

# Generate with seed values for the first batch
seeds = [
    {"customer_id": "C001", "region": "US"},
    {"customer_id": "C002", "region": "EU"},
]

status = batcher.generate_all_batch_lines(seed_fields=seeds)

Single Batch Generation

summary = batcher.generate_batch_lines(
    batch_idx=0,
    num_lines=100,
    max_invalid=500,
    raise_on_exceed_invalid=True,
)
print(f"Generated {summary.valid_lines} valid lines for batch 0")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment