Principle:Gretelai Gretel synthetics DataFrame Reassembly
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
DataFrame reassembly is the final step in the batch synthesis pipeline where per-batch synthetic data buffers are concatenated column-wise into a single DataFrame that mirrors the schema of the original training data.
Description
After generation, each batch holds its synthetic rows in an in-memory StringIO buffer (gen_data_stream). Reassembly reads these buffers back as individual DataFrames and concatenates them horizontally (along axis=1). The final DataFrame is then reindexed to match the original header order from the source data, ensuring the output schema is identical to the input.
Two reassembly paths exist in the codebase:
1. Low-level reassembly via batches_to_df()
The simplest approach iterates over the Batch objects, accesses each one's synthetic_df property (which reads gen_data_stream as a CSV using pd.read_csv), and concatenates the resulting DataFrames. The concatenated DataFrame is then column-reordered using either original_headers (preserved from the training DataFrame) or master_header_list (reconstructed from batch headers in read mode).
2. High-level reassembly via RecordFactory.generate_all()
The RecordFactory provides a higher-level, record-oriented generation pipeline. Instead of generating per-batch buffers separately, it creates generators for all batches simultaneously and constructs full records by pulling one line from each batch's generator in sequence. Records are validated both at the per-batch level and optionally at the whole-record level. The completed records are buffered (either as a list of dicts or into a _BufferedDataFrame that reads like a CSV), and returned as a GenerationResult.
Key design considerations:
- Column ordering — The original column order is stored during __init__ (as original_headers) and persisted to disk as original_headers.json. This ensures that even in read mode, the reassembled DataFrame has the correct column sequence.
- Type inference — When RecordFactory.generate_all(output="df") is used, the _BufferedDataFrame class writes records through a CSV writer and reads them back via pd.read_csv, allowing pandas to infer column dtypes as if reading from a file.
- Error resilience — RecordFactory.generate_all() catches RuntimeError and StopIteration during iteration and returns whatever records have been buffered so far, wrapped in a GenerationResult that can also carry the exception.
- Progress callbacks — The factory supports an optional callback that receives periodic GenerationProgress updates. A watchdog thread mode (callback_threading=True) ensures callbacks fire even during long stretches of invalid generation.
Usage
Use batches_to_df() for the simple low-level path after generate_all_batch_lines(). Use RecordFactory (via create_record_factory()) when you need whole-record validation, seeded generation, or streaming iteration over synthetic records.
Theoretical Basis
Reassembly reverses the vertical partitioning performed during batch data export. Given k batch DataFrames S_1, ..., S_k (each with the same number of rows but different column subsets), the final synthetic DataFrame is:
S = horizontal_concat(S_1, S_2, ..., S_k)
S = S[original_column_order]
This is equivalent to a natural join on row index, assuming all batches generated the same number of rows. If batch row counts differ, pandas concat with axis=1 will introduce NaN values for missing rows.
RecordFactory pseudocode:
function generate_all(batches, num_lines, output):
generators = [generate_text(batch.config) for batch in batches]
buffer = new BufferedDataFrame() if output == "df" else new BufferedDicts()
valid_count = 0
while valid_count < num_lines:
record = {}
for (batch, gen) in generators:
line = next valid line from gen
record.update(zip(batch.headers, line.values))
if whole_record_validator(record):
buffer.add(record)
valid_count += 1
return buffer.get_records()
The record-oriented approach guarantees that each row in the output DataFrame is internally consistent across all batches, because all batch contributions to a single row are generated and validated together.