Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Gretelai Gretel synthetics DataFrame Batch Synthesis

From Leeroopedia
Knowledge Sources
Domains Synthetic_Data, Tabular_Data, Deep_Learning
Last Updated 2026-02-14 19:00 GMT

Overview

End-to-end process for generating synthetic tabular data from a pandas DataFrame by splitting columns into correlated batches, training independent LSTM models per batch, and reassembling the results.

Description

This workflow wraps the LSTM text generation pipeline to handle tabular DataFrames with potentially many columns. The core idea is to split the DataFrame columns into smaller groups (batches) based on correlation clustering, train a separate LSTM model on each batch (treating each row as a delimited text line), and then generate synthetic rows from each batch model independently. Finally, the per-batch synthetic data is concatenated back into a full synthetic DataFrame matching the original schema. This approach enables synthetic data generation for wide tables that would be impractical to model as single text sequences.

Key outputs:

  • A set of trained per-batch LSTM model checkpoints
  • A synthetic DataFrame matching the original column structure

Usage

Execute this workflow when you have a pandas DataFrame (tabular data) and need to generate synthetic rows that preserve column distributions and inter-column correlations. This is the recommended approach for CSV-style structured data with the LSTM engine. It handles high column counts by automatically partitioning columns into manageable batches. For simple line-based text generation, use the LSTM Text Generation workflow instead.

Execution Steps

Step 1: DataFrame Preparation and Column Clustering

Load the source DataFrame and determine how to partition its columns into batches. By default, columns are split into groups of a configurable batch_size (default 15). Alternatively, a correlation-based clustering algorithm can group columns that are statistically related, ensuring correlated fields stay together within the same batch. The original column order is preserved for later reassembly.

Key considerations:

  • The batch_size parameter controls the maximum number of columns per batch
  • Custom batch_headers can be provided to manually control column grouping
  • The header_clusters utility computes Pearson and Cramér's V correlations to identify related columns
  • Each batch gets its own subdirectory under the main checkpoint directory

Step 2: Per-Batch Data Export

For each column batch, extract the corresponding columns from the DataFrame and write them to a delimited CSV file. Each batch receives its own TensorFlowConfig with the field_delimiter set to the column separator. The configuration, header list, and training data file path are stored in the batch's checkpoint directory.

Key considerations:

  • The field delimiter is configurable (default comma)
  • Each batch stores its own headers.json mapping column names
  • The gen_lines parameter in the config controls how many synthetic rows to generate

Step 3: Per-Batch Model Training

Train an independent LSTM model for each column batch using the train_all_batches method. Each batch follows the full LSTM training pipeline: tokenizer training, model building, and model fitting. Training proceeds sequentially across batches, with each batch using its own checkpoint directory.

Key considerations:

  • Training parameters (epochs, batch_size, early_stopping) are shared across all column batches
  • An epoch_callback can be provided to monitor training progress across batches, with the batch number injected into the EpochState
  • Each batch model learns the joint distribution of its assigned columns

Step 4: Per-Batch Synthetic Generation

Generate synthetic text records from each trained batch model using generate_all_batch_lines. Each batch model independently produces synthetic rows matching its column schema. Generated lines are validated against column count expectations, and invalid lines are tracked separately.

Key considerations:

  • max_invalid controls how many invalid lines are tolerated before raising an error
  • Parallel generation across CPU workers is supported within each batch
  • A seed_fields dictionary can provide starting values to guide generation
  • Results are accumulated in an in-memory StringIO buffer per batch

Step 5: DataFrame Reassembly

Combine the per-batch synthetic DataFrames back into a single DataFrame matching the original column order. The batches_to_df method concatenates all batch outputs column-wise and reorders columns to match the original header sequence.

Key considerations:

  • If some batches generated fewer valid rows than others, the resulting DataFrame may have mismatched row counts
  • The original_headers.json file preserves the exact column ordering from the source DataFrame
  • A RecordFactory alternative provides row-by-row generation across all batches simultaneously

Execution Diagram

GitHub URL

Workflow Repository