Implementation:Sdv dev SDV Simplify Schema

Knowledge Sources	SDV SDV Documentation
Domains	Data_Engineering, Synthetic_Data
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for simplifying multi-table schemas and subsampling data for proof-of-concept HMA synthesis, provided by the SDV library.

Description

The simplify_schema function reduces complex multi-table schemas by pruning distant tables and excess columns. The get_random_subset function subsamples rows while preserving referential integrity. Both are used for proof-of-concept workflows with HMASynthesizer.

Usage

Import these functions when working with complex multi-table data that causes HMASynthesizer to be slow or produce poor results.

Code Reference

Source Location

Repository: SDV
File: sdv/utils/poc.py
Lines: L29-141

Signature

def simplify_schema(data, metadata, verbose=True):
    """Simplify the schema of the data and metadata.

    Args:
        data (dict): Dictionary mapping table names to DataFrames.
        metadata (MultiTableMetadata): Metadata of the datasets.
        verbose (bool): Print simplification info. Defaults to True.

    Returns:
        tuple: (simplified_data: dict, simplified_metadata: MultiTableMetadata)
    """

def get_random_subset(data, metadata, main_table_name, num_rows, verbose=True):
    """Subsample multi-table data based on a table and number of rows.

    Args:
        data (dict): Dictionary mapping table names to DataFrames.
        metadata (MultiTableMetadata): Metadata of the datasets.
        main_table_name (str): Name of the main table.
        num_rows (int): Number of rows to keep in the main table.
        verbose (bool): Print subsampling info. Defaults to True.

    Returns:
        dict: Dictionary with subsampled DataFrames.
    """

Import

from sdv.utils.poc import simplify_schema, get_random_subset

I/O Contract

Inputs (simplify_schema)

Name	Type	Required	Description
data	dict[str, pd.DataFrame]	Yes	Multi-table data
metadata	Metadata	Yes	Multi-table metadata with relationships
verbose	bool	No	Print info (default: True)

Outputs (simplify_schema)

Name	Type	Description
return value	tuple(dict, Metadata)	Simplified data and metadata

Inputs (get_random_subset)

Name	Type	Required	Description
data	dict[str, pd.DataFrame]	Yes	Multi-table data
metadata	Metadata	Yes	Multi-table metadata
main_table_name	str	Yes	Main table for subsetting
num_rows	int	Yes	Target row count for main table
verbose	bool	No	Print info (default: True)

Outputs (get_random_subset)

Name	Type	Description
return value	dict[str, pd.DataFrame]	Subsampled data preserving referential integrity

Usage Examples

from sdv.datasets.demo import download_demo
from sdv.utils.poc import simplify_schema, get_random_subset

data, metadata = download_demo(modality='multi_table', dataset_name='fake_hotels')

# Simplify schema
simple_data, simple_metadata = simplify_schema(data, metadata)

# Subsample to 100 rows
subset = get_random_subset(data, metadata, main_table_name='hotels', num_rows=100)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment