Workflow:Gretelai Gretel synthetics LSTM Text Generation
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, NLP, Deep_Learning |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
End-to-end process for training a character-level or subword LSTM neural network on text data and generating synthetic text records using TensorFlow.
Description
This workflow implements the core text generation pipeline in gretel-synthetics. It trains a recurrent neural network (LSTM) on line-delimited text data to learn sequential patterns, then uses the trained model to generate novel synthetic text that preserves the statistical properties of the original data. The pipeline supports both character-level tokenization and SentencePiece subword tokenization, optional differential privacy via DP-SGD, configurable early stopping, and parallel generation across multiple CPU workers.
Key outputs:
- A trained LSTM model checkpoint
- Synthetic text records matching the learned distribution
Usage
Execute this workflow when you have line-delimited text data (one record per line, optionally with delimited fields) and need to generate synthetic text records that preserve the structure and patterns of the original data. This is suitable for single-column or simple multi-field data where each line is an independent record. For tabular DataFrames with many columns, use the DataFrame Batch Synthesis workflow instead.
Execution Steps
Step 1: Configuration
Create a TensorFlowConfig object specifying the path to the input text data, the checkpoint directory for model storage, and training hyperparameters. Key parameters include the number of epochs, batch size, sequence length, embedding dimension, RNN units, dropout rate, and whether to enable differential privacy. The configuration also controls generation settings such as temperature and maximum characters per line.
Key considerations:
- Set vocab_size to 0 for character tokenization, or a positive value (default 20000) for SentencePiece subword tokenization
- Enable validation_split (default True) to prevent overfitting
- For differential privacy, set dp=True and tune dp_noise_multiplier, dp_l2_norm_clip, and dp_microbatches
- For field-delimited data, set field_delimiter to the separator character
Step 2: Tokenizer Training
Train a tokenizer on the input text to build a vocabulary mapping text to integer IDs. If vocab_size is 0, a character-level tokenizer is created that maps each unique character to an ID. Otherwise, a SentencePiece tokenizer learns subword units using byte-pair-encoding. The tokenizer also annotates the training data by converting each line into a sequence of token IDs stored in the training data file.
Key considerations:
- SentencePiece is recommended for non-DP mode as it provides faster training and better accuracy
- Character tokenization is recommended for DP mode to avoid memorizing sensitive tokens
- The tokenizer handles special delimiter tokens if field_delimiter is configured
Step 3: Model Building
Construct the LSTM neural network architecture. The model consists of an embedding layer, one or more LSTM layers with configurable units and dropout, and a dense output layer over the vocabulary. If differential privacy is enabled, a DP-SGD variant of the model is built that clips per-example gradients and adds calibrated noise.
Key considerations:
- The standard model uses the Adam optimizer; the DP model uses a DP-SGD optimizer
- The model architecture is: Embedding → LSTM → Dropout → Dense
- GPU availability is checked and a warning issued if only CPU is available
Step 4: Model Training
Train the LSTM model on the tokenized text data for the configured number of epochs. The training loop processes sequences of token IDs, predicting the next token at each position. Early stopping monitors validation loss (or training loss if no validation split) and halts training when improvement stagnates for a configurable number of epochs. Model checkpoints are saved, with the best model tracked by default.
Key considerations:
- Early stopping patience defaults to 5 epochs with a minimum delta of 0.001
- An optional epoch_callback receives an EpochState with loss, accuracy, and DP epsilon/delta values
- An optional max_training_time_seconds can impose a wall-clock time limit
- Model parameters are serialized to JSON in the checkpoint directory before training begins
Step 5: Synthetic Text Generation
Load the trained model and tokenizer from the checkpoint directory and generate synthetic text records. The generator predicts tokens one at a time, sampling from the output distribution scaled by a temperature parameter. Generation continues until a newline token is produced or the maximum character limit is reached. An optional line_validator callback filters generated records, and generation stops after the requested number of valid lines is produced.
Key considerations:
- Temperature controls randomness: lower values produce more predictable text, higher values more diverse
- Parallel generation distributes work across multiple CPU workers using loky process pools
- A start_string can seed generation to influence the output prefix
- The generator yields GenText objects with valid/invalid status and explanation fields