Workflow:Sdv dev SDV Single table synthesis

Knowledge Sources	SDV SDV Documentation The Synthetic Data Vault
Domains	Synthetic_Data, Single_Table, Machine_Learning
Last Updated	2026-02-14 19:00 GMT

Overview

End-to-end process for generating synthetic data from a single table using the SDV library, from loading demo data and defining metadata through model fitting, sampling, and quality evaluation.

Description

This workflow covers the standard procedure for creating synthetic copies of single-table datasets. The SDV provides four synthesizer options: GaussianCopulaSynthesizer (classical statistical copula model), CTGANSynthesizer (GAN-based deep learning), TVAESynthesizer (variational autoencoder), and CopulaGANSynthesizer (hybrid copula + GAN). The process begins by loading or defining table metadata that describes column types, primary keys, and anonymization settings. A synthesizer is then initialized, fitted on real data, and used to generate synthetic rows that preserve statistical properties while protecting privacy. The workflow concludes with quality evaluation to verify that synthetic data faithfully reproduces real data distributions.

Usage

Execute this workflow when you have a single tabular dataset (CSV, DataFrame) and need to generate synthetic rows that preserve column distributions, inter-column correlations, and key uniqueness. Typical triggers include privacy-preserving data sharing, augmenting small datasets for testing, or creating realistic demo data without exposing sensitive records.

Execution Steps

Step 1: Load or prepare data

Obtain the real dataset as a pandas DataFrame. The SDV provides a demo dataset downloader that fetches example datasets from S3 (such as fake_hotel_guests) along with pre-built metadata. Alternatively, load your own CSV or DataFrame.

Key considerations:

The demo downloader returns both the data and its metadata object
Supported modalities for demos are single_table, multi_table, and sequential
For custom data, ensure the DataFrame is clean and has consistent column types

Step 2: Define or detect metadata

Create a Metadata object that describes the table schema: column semantic data types (sdtypes), primary keys, and any anonymization requirements. Metadata can be auto-detected from a DataFrame or manually defined. The unified Metadata class wraps both single-table and multi-table schemas.

Key considerations:

Auto-detection infers sdtypes (numerical, categorical, datetime, boolean, id) from data
Primary keys are auto-detected as unique ID columns
PII columns (emails, addresses, names) should be marked with appropriate sdtypes for anonymization
Always validate metadata before fitting to catch schema errors early

Step 3: Initialize synthesizer

Choose and instantiate a synthesizer class with the metadata object. Each synthesizer type has different trade-offs: GaussianCopula is fast and interpretable, CTGAN handles complex distributions, TVAE offers stable training, and CopulaGAN combines both approaches.

Key considerations:

All synthesizers accept enforce_min_max_values and enforce_rounding parameters
The locales parameter controls anonymized data locale (defaults to en_US)
GaussianCopula allows specifying distribution families per column
CTGAN and TVAE accept epochs, batch_size, and cuda parameters for training control

Step 4: Fit synthesizer on real data

Call the fit method with the real DataFrame. Internally, this triggers data preprocessing (type conversion, constraint handling, anonymization via HyperTransformer from the RDT library), followed by model training on the transformed data.

Key considerations:

Fitting preprocesses data through a DataProcessor pipeline
Numerical columns are formatted and clipped; categorical columns are encoded
The model learns joint distributions of all columns
Fitting logs events (table name, row count, duration) to the SDV structured logger

Step 5: Sample synthetic data

Generate new rows by calling the sample method with the desired number of rows. The synthesizer generates data from the learned model and applies reverse transformations to restore original data types and formats.

Key considerations:

The num_rows parameter controls output size
Sampled primary keys are guaranteed unique
Anonymized columns produce new realistic fake values
Numerical columns respect min/max bounds and rounding from the original data

Step 6: Evaluate synthetic data quality

Compare synthetic data against real data using the SDV evaluation module. Generate a quality report that scores column shape similarity and column pair trend preservation. Optionally run a diagnostic report and create visualizations.

Key considerations:

Quality reports score on a 0-100% scale across Column Shapes and Column Pair Trends
Diagnostic reports check for data validity (correct ranges, categories)
Column plots and column pair plots provide visual comparisons
Evaluation delegates to the sdmetrics library

Step 7: Save and load synthesizer

Persist the trained synthesizer to disk for later reuse. The SDV uses cloudpickle for serialization and includes version compatibility checks on load.

Key considerations:

Save produces a pickle file containing the full model state
Load validates SDV version compatibility and warns about mismatches
Saved synthesizers can generate new samples without refitting

Execution Diagram

GitHub URL

Workflow Repository