Workflow:Sdv dev SDV Sequential data synthesis
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Sequential_Data, Timeseries, Deep_Learning |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
End-to-end process for generating synthetic sequential (timeseries) data using the PARSynthesizer, preserving temporal patterns, sequence structures, and context column relationships.
Description
This workflow covers the generation of synthetic sequential data where rows are ordered in time and grouped by entity. The PARSynthesizer (Probabilistic AutoRegressive) uses a deep learning autoregressive model from the DeepEcho library to learn temporal dependencies within sequences. It separates columns into context columns (attributes that remain constant within a sequence, such as entity ID or category) and non-context columns (values that change over time). Context columns are modeled by a separate GaussianCopulaSynthesizer, and the sequential columns are generated conditioned on these context values.
Usage
Execute this workflow when you have timeseries or sequential data where rows are grouped by a sequence key (e.g., patient ID, device ID) and ordered by time, and you need to generate synthetic sequences that preserve temporal patterns, trends, and entity-level context. Common applications include IoT sensor data, patient health records over time, financial transaction sequences, and stock price histories.
Execution Steps
Step 1: Load sequential data
Obtain the real sequential dataset as a pandas DataFrame. The SDV demo downloader supports sequential modality for example timeseries datasets. The data must contain a sequence key column that identifies which entity each row belongs to.
Key considerations:
- The DataFrame must be sorted by sequence key and time
- Each sequence (group of rows sharing the same key) represents one entity's timeline
- The SDV demo provides example sequential datasets with pre-built metadata
Step 2: Define metadata with sequence key
Create a Metadata object that describes the table schema including the sequence key column. The sequence key identifies which rows belong to the same sequence (entity). Additionally, identify context columns that remain constant within each sequence.
Key considerations:
- The sequence key column must be marked in the metadata as an ID column
- Context columns are values that do not change within a single sequence
- Non-context columns are the time-varying values the model will learn to generate autoregressively
- Validate that every sequence has consistent context column values
Step 3: Initialize PARSynthesizer
Instantiate PARSynthesizer with the metadata, the sequence key, and optionally the list of context columns. Configure training parameters such as epochs, segment size, sample size, and CUDA usage.
Key considerations:
- context_columns specifies which columns are constant per sequence
- epochs controls training duration (default 128)
- segment_size can split long sequences into shorter training segments
- sample_size controls how many candidate samples are generated before selecting the best one
- cuda enables GPU acceleration if available
- Requires the DeepEcho library to be installed
Step 4: Fit on sequential data
Call fit with the DataFrame. Internally, the PARSynthesizer separates context and non-context columns, fits a GaussianCopulaSynthesizer on the context data, and trains the DeepEcho PARModel on the sequential data. Numerical columns are differenced and formatted before training.
Key considerations:
- Context columns are extracted per unique sequence key and modeled independently
- Non-context columns are assembled into sequences and fed to the autoregressive model
- The model learns conditional distributions for each time step given previous steps
- Training loss values can be retrieved after fitting for monitoring
Step 5: Sample synthetic sequences
Generate new synthetic sequences by calling sample with the desired number of sequences. The sampler first generates context values from the context synthesizer, then produces sequential data conditioned on these contexts.
Key considerations:
- num_sequences controls how many entity timelines to generate
- sequence_length can fix the length of generated sequences
- Each synthetic sequence gets a new unique sequence key
- Context column values in each sequence are internally consistent
- Conditional sampling is supported to fix specific context values
Step 6: Evaluate sequential data quality
Assess the quality of synthetic sequential data using the single-table evaluation functions applied to the flattened output. Compare temporal distributions, context value distributions, and sequence length characteristics.
Key considerations:
- Standard quality reports can evaluate column-level distributions
- Sequence-specific quality (temporal autocorrelation, trend preservation) may require custom analysis
- Compare the distribution of sequence lengths between real and synthetic data