Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Sdv dev SDV Multi table synthesis

From Leeroopedia
Knowledge Sources
Domains Synthetic_Data, Multi_Table, Relational_Data, Machine_Learning
Last Updated 2026-02-14 19:00 GMT

Overview

End-to-end process for generating synthetic relational data across multiple connected tables using the HMASynthesizer, preserving referential integrity and inter-table statistical relationships.

Description

This workflow covers the generation of synthetic data for multi-table relational datasets. The HMASynthesizer (Hierarchical Modeling Algorithm) models parent-child table relationships by augmenting parent tables with statistical summaries of their children, then fitting a single-table synthesizer (GaussianCopula by default) per table. During sampling, it generates parent rows first and reconstructs child rows while preserving foreign key relationships and cardinality distributions. For large or complex schemas, the workflow includes optional proof-of-concept utilities to simplify the schema and subsample data before fitting.

Usage

Execute this workflow when you have a relational database with multiple tables connected by primary and foreign keys and need to generate a synthetic copy that preserves both per-table distributions and cross-table referential integrity. Common triggers include testing database migrations, creating realistic development environments, or sharing multi-table datasets while protecting privacy.

Execution Steps

Step 1: Load multi_table data

Obtain the real data as a dictionary mapping table names to pandas DataFrames. The SDV demo downloader supports multi_table modality to fetch example relational datasets with pre-built metadata.

Key considerations:

  • Data is represented as a dict of DataFrames keyed by table name
  • Each table must have consistent column types and valid key columns
  • Use drop_unknown_references utility to clean orphan foreign keys before fitting

Step 2: Define or detect multi_table metadata

Create a Metadata object that describes all tables, their columns, primary keys, and the relationships (foreign keys) between them. Metadata can be auto-detected from the data dictionary or loaded from a JSON file. The unified Metadata class handles both single and multi-table schemas.

Key considerations:

  • Auto-detection with detect_from_dataframes infers table schemas and relationships
  • Relationships are defined as parent-child pairs with foreign key mappings
  • Primary keys must be unique within each table; foreign keys must reference valid parent keys
  • Validate metadata and data together to catch referential integrity issues early

Step 3: Simplify schema (optional)

For complex schemas with many tables or deeply nested hierarchies, use the proof-of-concept utilities to reduce schema complexity. The simplify_schema function removes distant tables and excess columns to stay within the HMA column limit.

Key considerations:

  • HMA has a maximum column limit (1000) for the augmented parent table representation
  • simplify_schema prunes grandchild table columns and distant relationships
  • get_random_subset subsamples data while preserving referential integrity
  • These are optional optimization steps for large schemas

Step 4: Initialize HMASynthesizer

Instantiate the HMASynthesizer with the multi-table metadata. Optionally configure per-table synthesizer parameters and the verbose flag for progress display. The HMA internally creates one GaussianCopulaSynthesizer per table.

Key considerations:

  • Default per-table synthesizer uses beta distribution
  • Custom per-table parameters can be set via set_table_parameters
  • The verbose flag controls whether progress bars are shown during fit and sample
  • The synthesizer validates that the metadata describes a valid relational schema

Step 5: Fit on multi_table data

Call fit with the data dictionary. Internally, HMA augments parent tables with statistical summaries of child table columns (distribution parameters, row counts, correlations), then fits individual GaussianCopulaSynthesizers per table on the augmented data.

Key considerations:

  • Table augmentation creates extension columns capturing child-row distributions per parent
  • Fitting proceeds table by table with progress reporting
  • The DataProcessor pipeline handles preprocessing, constraint application, and type conversion per table
  • Constraints can be applied at the multi-table level through the CAG system

Step 6: Sample synthetic relational data

Generate synthetic data by calling sample with a scale parameter or row count. The hierarchical sampler generates parent table rows first, then reconstructs child rows preserving cardinality and foreign key relationships.

Key considerations:

  • scale parameter multiplies all table sizes proportionally
  • num_rows can set exact row counts per root table
  • Foreign keys in child tables reference valid primary keys in parent tables
  • Cardinality distribution (number of children per parent) is preserved from real data

Step 7: Evaluate multi_table quality

Assess synthetic data quality using multi-table evaluation functions. Quality reports score per-table column distributions and cross-table relationship fidelity. Cardinality plots visualize parent-child row count distributions.

Key considerations:

  • Multi-table quality reports aggregate per-table scores
  • Cardinality plots compare real vs synthetic child-per-parent distributions
  • Column plots and column pair plots work per-table within the multi-table context
  • Diagnostic reports check data validity across all tables

Execution Diagram

GitHub URL

Workflow Repository