Implementation:Gretelai Gretel synthetics DataFrameBatch Batches To Df
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Concrete tool for reassembling per-batch synthetic data into a single DataFrame provided by the gretel-synthetics library.
Description
DataFrameBatch.batches_to_df() concatenates the synthetic DataFrames from all batches into a single output DataFrame with the original column order. It iterates over the batch objects, reads each batch's synthetic_df property (which parses the in-memory gen_data_stream StringIO as a CSV), and horizontally concatenates them using pd.concat(..., axis=1). The resulting DataFrame is then reindexed to match original_headers (or master_header_list if original headers are unavailable).
The higher-level alternative is RecordFactory.generate_all() (lines 833-940), accessible via DataFrameBatch.create_record_factory(). The RecordFactory creates generators for all batches simultaneously and constructs complete records by pulling one valid line from each batch per row. Records are validated at both the per-batch and whole-record level. Results are buffered either as a list of dictionaries (default) or in a _BufferedDataFrame (when output="df"), and returned as a GenerationResult object containing the records and any exception that may have occurred.
Key classes involved:
- RecordFactory (line 560) — stateful factory with iterator protocol, reset(), generate_all(), and summary property.
- _BufferedDataFrame — writes records through a CSV writer (file or StringIO), reads back as DataFrame via pd.read_csv for dtype inference.
- _BufferedDicts — simple list-of-dicts accumulator.
- GenerationResult — container holding records (DataFrame or list) and optional exception.
- GenerationProgress — dataclass communicated to callbacks with valid/invalid counts and completion percentage.
Usage
Use batches_to_df() as the final step after generate_all_batch_lines() for the simple low-level workflow. Use RecordFactory for record-level validation, seeded generation from a list of seeds, or streaming record-by-record iteration.
Code Reference
Source Location
- Repository: gretel-synthetics
- File:
src/gretel_synthetics/batch.py - Lines: 1406-1420 (batches_to_df), 560-949 (RecordFactory), 833-940 (generate_all)
Signature
def batches_to_df(self) -> pd.DataFrame:
"""Convert all batches to a single synthetic data DataFrame.
Returns:
A single DataFrame that is the concatenation of all the
batch DataFrames.
"""
# RecordFactory.generate_all
def generate_all(
self,
output: Optional[str] = None,
callback: Optional[callable] = None,
callback_interval: int = 30,
callback_threading: bool = False,
) -> GenerationResult:
Import
from gretel_synthetics.batch import DataFrameBatch
I/O Contract
Inputs
batches_to_df:
| Name | Type | Required | Description |
|---|---|---|---|
| self | DataFrameBatch | Yes | Must have populated gen_data_stream buffers in each batch (i.e., generation has been run). |
RecordFactory.generate_all:
| Name | Type | Required | Description |
|---|---|---|---|
| output | Optional[str] | No (default None) | If "df", return records as a DataFrame with inferred dtypes. If None, return list of dicts. |
| callback | Optional[callable] | No | Callable receiving GenerationProgress instances for progress monitoring. |
| callback_interval | int | No (default 30) | Minimum seconds between callback invocations. |
| callback_threading | bool | No (default False) | If True, use a watchdog thread to fire callbacks even during invalid-line stretches. |
Outputs
batches_to_df:
| Name | Type | Description |
|---|---|---|
| return | pd.DataFrame | Single DataFrame with all columns in original order, rows from synthetic generation. |
RecordFactory.generate_all:
| Name | Type | Description |
|---|---|---|
| return | GenerationResult | Contains records (pd.DataFrame or List[dict]) and optional exception (Exception or None). |
Usage Examples
Basic Example: Low-Level Reassembly
from gretel_synthetics.batch import DataFrameBatch
# Assume batcher is trained and generation is complete
batcher.generate_all_batch_lines(num_lines=1000)
synthetic_df = batcher.batches_to_df()
print(synthetic_df.shape)
print(synthetic_df.columns.tolist())
RecordFactory Example
# Create a record factory with whole-record validation
def my_validator(record: dict) -> bool:
return float(record.get("price", 0)) >= 0
factory = batcher.create_record_factory(
num_lines=500,
validator=my_validator,
)
# Generate all records as a DataFrame
result = factory.generate_all(output="df")
synthetic_df = result.records
print(f"Generated {len(synthetic_df)} records")
if result.exception:
print(f"Warning: generation ended early due to {result.exception}")
Streaming Iteration
factory = batcher.create_record_factory(num_lines=100)
for record in factory:
# Each record is a dict with all column values as strings
process(record)
print(factory.summary)