Workflow:Gretelai Gretel synthetics LSTM Text Generation

Knowledge Sources	gretel-synthetics gretel-synthetics Docs TensorFlow Privacy
Domains	Synthetic_Data, NLP, Deep_Learning
Last Updated	2026-02-14 19:00 GMT

Overview

End-to-end process for training a character-level or subword LSTM neural network on text data and generating synthetic text records using TensorFlow.

Description

This workflow implements the core text generation pipeline in gretel-synthetics. It trains a recurrent neural network (LSTM) on line-delimited text data to learn sequential patterns, then uses the trained model to generate novel synthetic text that preserves the statistical properties of the original data. The pipeline supports both character-level tokenization and SentencePiece subword tokenization, optional differential privacy via DP-SGD, configurable early stopping, and parallel generation across multiple CPU workers.

Key outputs:

A trained LSTM model checkpoint
Synthetic text records matching the learned distribution

Usage

Execute this workflow when you have line-delimited text data (one record per line, optionally with delimited fields) and need to generate synthetic text records that preserve the structure and patterns of the original data. This is suitable for single-column or simple multi-field data where each line is an independent record. For tabular DataFrames with many columns, use the DataFrame Batch Synthesis workflow instead.

Execution Steps

Step 1: Configuration

Create a TensorFlowConfig object specifying the path to the input text data, the checkpoint directory for model storage, and training hyperparameters. Key parameters include the number of epochs, batch size, sequence length, embedding dimension, RNN units, dropout rate, and whether to enable differential privacy. The configuration also controls generation settings such as temperature and maximum characters per line.

Key considerations:

Set vocab_size to 0 for character tokenization, or a positive value (default 20000) for SentencePiece subword tokenization
Enable validation_split (default True) to prevent overfitting
For differential privacy, set dp=True and tune dp_noise_multiplier, dp_l2_norm_clip, and dp_microbatches
For field-delimited data, set field_delimiter to the separator character

Step 2: Tokenizer Training

Train a tokenizer on the input text to build a vocabulary mapping text to integer IDs. If vocab_size is 0, a character-level tokenizer is created that maps each unique character to an ID. Otherwise, a SentencePiece tokenizer learns subword units using byte-pair-encoding. The tokenizer also annotates the training data by converting each line into a sequence of token IDs stored in the training data file.

Key considerations:

SentencePiece is recommended for non-DP mode as it provides faster training and better accuracy
Character tokenization is recommended for DP mode to avoid memorizing sensitive tokens
The tokenizer handles special delimiter tokens if field_delimiter is configured

Step 3: Model Building

Construct the LSTM neural network architecture. The model consists of an embedding layer, one or more LSTM layers with configurable units and dropout, and a dense output layer over the vocabulary. If differential privacy is enabled, a DP-SGD variant of the model is built that clips per-example gradients and adds calibrated noise.

Key considerations:

The standard model uses the Adam optimizer; the DP model uses a DP-SGD optimizer
The model architecture is: Embedding → LSTM → Dropout → Dense
GPU availability is checked and a warning issued if only CPU is available

Step 4: Model Training

Train the LSTM model on the tokenized text data for the configured number of epochs. The training loop processes sequences of token IDs, predicting the next token at each position. Early stopping monitors validation loss (or training loss if no validation split) and halts training when improvement stagnates for a configurable number of epochs. Model checkpoints are saved, with the best model tracked by default.

Key considerations:

Early stopping patience defaults to 5 epochs with a minimum delta of 0.001
An optional epoch_callback receives an EpochState with loss, accuracy, and DP epsilon/delta values
An optional max_training_time_seconds can impose a wall-clock time limit
Model parameters are serialized to JSON in the checkpoint directory before training begins

Step 5: Synthetic Text Generation

Load the trained model and tokenizer from the checkpoint directory and generate synthetic text records. The generator predicts tokens one at a time, sampling from the output distribution scaled by a temperature parameter. Generation continues until a newline token is produced or the maximum character limit is reached. An optional line_validator callback filters generated records, and generation stops after the requested number of valid lines is produced.

Key considerations:

Temperature controls randomness: lower values produce more predictable text, higher values more diverse
Parallel generation distributes work across multiple CPU workers using loky process pools
A start_string can seed generation to influence the output prefix
The generator yields GenText objects with valid/invalid status and explanation fields

Execution Diagram

GitHub URL

Workflow Repository