Workflow:Sdv dev SDV Single table synthesis
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Single_Table, Machine_Learning |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
End-to-end process for generating synthetic data from a single table using the SDV library, from loading demo data and defining metadata through model fitting, sampling, and quality evaluation.
Description
This workflow covers the standard procedure for creating synthetic copies of single-table datasets. The SDV provides four synthesizer options: GaussianCopulaSynthesizer (classical statistical copula model), CTGANSynthesizer (GAN-based deep learning), TVAESynthesizer (variational autoencoder), and CopulaGANSynthesizer (hybrid copula + GAN). The process begins by loading or defining table metadata that describes column types, primary keys, and anonymization settings. A synthesizer is then initialized, fitted on real data, and used to generate synthetic rows that preserve statistical properties while protecting privacy. The workflow concludes with quality evaluation to verify that synthetic data faithfully reproduces real data distributions.
Usage
Execute this workflow when you have a single tabular dataset (CSV, DataFrame) and need to generate synthetic rows that preserve column distributions, inter-column correlations, and key uniqueness. Typical triggers include privacy-preserving data sharing, augmenting small datasets for testing, or creating realistic demo data without exposing sensitive records.
Execution Steps
Step 1: Load or prepare data
Obtain the real dataset as a pandas DataFrame. The SDV provides a demo dataset downloader that fetches example datasets from S3 (such as fake_hotel_guests) along with pre-built metadata. Alternatively, load your own CSV or DataFrame.
Key considerations:
- The demo downloader returns both the data and its metadata object
- Supported modalities for demos are single_table, multi_table, and sequential
- For custom data, ensure the DataFrame is clean and has consistent column types
Step 2: Define or detect metadata
Create a Metadata object that describes the table schema: column semantic data types (sdtypes), primary keys, and any anonymization requirements. Metadata can be auto-detected from a DataFrame or manually defined. The unified Metadata class wraps both single-table and multi-table schemas.
Key considerations:
- Auto-detection infers sdtypes (numerical, categorical, datetime, boolean, id) from data
- Primary keys are auto-detected as unique ID columns
- PII columns (emails, addresses, names) should be marked with appropriate sdtypes for anonymization
- Always validate metadata before fitting to catch schema errors early
Step 3: Initialize synthesizer
Choose and instantiate a synthesizer class with the metadata object. Each synthesizer type has different trade-offs: GaussianCopula is fast and interpretable, CTGAN handles complex distributions, TVAE offers stable training, and CopulaGAN combines both approaches.
Key considerations:
- All synthesizers accept enforce_min_max_values and enforce_rounding parameters
- The locales parameter controls anonymized data locale (defaults to en_US)
- GaussianCopula allows specifying distribution families per column
- CTGAN and TVAE accept epochs, batch_size, and cuda parameters for training control
Step 4: Fit synthesizer on real data
Call the fit method with the real DataFrame. Internally, this triggers data preprocessing (type conversion, constraint handling, anonymization via HyperTransformer from the RDT library), followed by model training on the transformed data.
Key considerations:
- Fitting preprocesses data through a DataProcessor pipeline
- Numerical columns are formatted and clipped; categorical columns are encoded
- The model learns joint distributions of all columns
- Fitting logs events (table name, row count, duration) to the SDV structured logger
Step 5: Sample synthetic data
Generate new rows by calling the sample method with the desired number of rows. The synthesizer generates data from the learned model and applies reverse transformations to restore original data types and formats.
Key considerations:
- The num_rows parameter controls output size
- Sampled primary keys are guaranteed unique
- Anonymized columns produce new realistic fake values
- Numerical columns respect min/max bounds and rounding from the original data
Step 6: Evaluate synthetic data quality
Compare synthetic data against real data using the SDV evaluation module. Generate a quality report that scores column shape similarity and column pair trend preservation. Optionally run a diagnostic report and create visualizations.
Key considerations:
- Quality reports score on a 0-100% scale across Column Shapes and Column Pair Trends
- Diagnostic reports check for data validity (correct ranges, categories)
- Column plots and column pair plots provide visual comparisons
- Evaluation delegates to the sdmetrics library
Step 7: Save and load synthesizer
Persist the trained synthesizer to disk for later reuse. The SDV uses cloudpickle for serialization and includes version compatibility checks on load.
Key considerations:
- Save produces a pickle file containing the full model state
- Load validates SDV version compatibility and warns about mismatches
- Saved synthesizers can generate new samples without refitting