Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Gretelai Gretel synthetics DataFrameBatch Batches To Df

From Leeroopedia
Knowledge Sources
Domains Synthetic_Data, Tabular_Data
Last Updated 2026-02-14 19:00 GMT

Overview

Concrete tool for reassembling per-batch synthetic data into a single DataFrame provided by the gretel-synthetics library.

Description

DataFrameBatch.batches_to_df() concatenates the synthetic DataFrames from all batches into a single output DataFrame with the original column order. It iterates over the batch objects, reads each batch's synthetic_df property (which parses the in-memory gen_data_stream StringIO as a CSV), and horizontally concatenates them using pd.concat(..., axis=1). The resulting DataFrame is then reindexed to match original_headers (or master_header_list if original headers are unavailable).

The higher-level alternative is RecordFactory.generate_all() (lines 833-940), accessible via DataFrameBatch.create_record_factory(). The RecordFactory creates generators for all batches simultaneously and constructs complete records by pulling one valid line from each batch per row. Records are validated at both the per-batch and whole-record level. Results are buffered either as a list of dictionaries (default) or in a _BufferedDataFrame (when output="df"), and returned as a GenerationResult object containing the records and any exception that may have occurred.

Key classes involved:

  • RecordFactory (line 560) — stateful factory with iterator protocol, reset(), generate_all(), and summary property.
  • _BufferedDataFrame — writes records through a CSV writer (file or StringIO), reads back as DataFrame via pd.read_csv for dtype inference.
  • _BufferedDicts — simple list-of-dicts accumulator.
  • GenerationResult — container holding records (DataFrame or list) and optional exception.
  • GenerationProgress — dataclass communicated to callbacks with valid/invalid counts and completion percentage.

Usage

Use batches_to_df() as the final step after generate_all_batch_lines() for the simple low-level workflow. Use RecordFactory for record-level validation, seeded generation from a list of seeds, or streaming record-by-record iteration.

Code Reference

Source Location

  • Repository: gretel-synthetics
  • File: src/gretel_synthetics/batch.py
  • Lines: 1406-1420 (batches_to_df), 560-949 (RecordFactory), 833-940 (generate_all)

Signature

def batches_to_df(self) -> pd.DataFrame:
    """Convert all batches to a single synthetic data DataFrame.

    Returns:
        A single DataFrame that is the concatenation of all the
        batch DataFrames.
    """

# RecordFactory.generate_all
def generate_all(
    self,
    output: Optional[str] = None,
    callback: Optional[callable] = None,
    callback_interval: int = 30,
    callback_threading: bool = False,
) -> GenerationResult:

Import

from gretel_synthetics.batch import DataFrameBatch

I/O Contract

Inputs

batches_to_df:

Name Type Required Description
self DataFrameBatch Yes Must have populated gen_data_stream buffers in each batch (i.e., generation has been run).

RecordFactory.generate_all:

Name Type Required Description
output Optional[str] No (default None) If "df", return records as a DataFrame with inferred dtypes. If None, return list of dicts.
callback Optional[callable] No Callable receiving GenerationProgress instances for progress monitoring.
callback_interval int No (default 30) Minimum seconds between callback invocations.
callback_threading bool No (default False) If True, use a watchdog thread to fire callbacks even during invalid-line stretches.

Outputs

batches_to_df:

Name Type Description
return pd.DataFrame Single DataFrame with all columns in original order, rows from synthetic generation.

RecordFactory.generate_all:

Name Type Description
return GenerationResult Contains records (pd.DataFrame or List[dict]) and optional exception (Exception or None).

Usage Examples

Basic Example: Low-Level Reassembly

from gretel_synthetics.batch import DataFrameBatch

# Assume batcher is trained and generation is complete
batcher.generate_all_batch_lines(num_lines=1000)
synthetic_df = batcher.batches_to_df()

print(synthetic_df.shape)
print(synthetic_df.columns.tolist())

RecordFactory Example

# Create a record factory with whole-record validation
def my_validator(record: dict) -> bool:
    return float(record.get("price", 0)) >= 0

factory = batcher.create_record_factory(
    num_lines=500,
    validator=my_validator,
)

# Generate all records as a DataFrame
result = factory.generate_all(output="df")
synthetic_df = result.records
print(f"Generated {len(synthetic_df)} records")
if result.exception:
    print(f"Warning: generation ended early due to {result.exception}")

Streaming Iteration

factory = batcher.create_record_factory(num_lines=100)

for record in factory:
    # Each record is a dict with all column values as strings
    process(record)

print(factory.summary)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment