Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sdv dev SDV Simplify Schema

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Synthetic_Data
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for simplifying multi-table schemas and subsampling data for proof-of-concept HMA synthesis, provided by the SDV library.

Description

The simplify_schema function reduces complex multi-table schemas by pruning distant tables and excess columns. The get_random_subset function subsamples rows while preserving referential integrity. Both are used for proof-of-concept workflows with HMASynthesizer.

Usage

Import these functions when working with complex multi-table data that causes HMASynthesizer to be slow or produce poor results.

Code Reference

Source Location

  • Repository: SDV
  • File: sdv/utils/poc.py
  • Lines: L29-141

Signature

def simplify_schema(data, metadata, verbose=True):
    """Simplify the schema of the data and metadata.

    Args:
        data (dict): Dictionary mapping table names to DataFrames.
        metadata (MultiTableMetadata): Metadata of the datasets.
        verbose (bool): Print simplification info. Defaults to True.

    Returns:
        tuple: (simplified_data: dict, simplified_metadata: MultiTableMetadata)
    """

def get_random_subset(data, metadata, main_table_name, num_rows, verbose=True):
    """Subsample multi-table data based on a table and number of rows.

    Args:
        data (dict): Dictionary mapping table names to DataFrames.
        metadata (MultiTableMetadata): Metadata of the datasets.
        main_table_name (str): Name of the main table.
        num_rows (int): Number of rows to keep in the main table.
        verbose (bool): Print subsampling info. Defaults to True.

    Returns:
        dict: Dictionary with subsampled DataFrames.
    """

Import

from sdv.utils.poc import simplify_schema, get_random_subset

I/O Contract

Inputs (simplify_schema)

Name Type Required Description
data dict[str, pd.DataFrame] Yes Multi-table data
metadata Metadata Yes Multi-table metadata with relationships
verbose bool No Print info (default: True)

Outputs (simplify_schema)

Name Type Description
return value tuple(dict, Metadata) Simplified data and metadata

Inputs (get_random_subset)

Name Type Required Description
data dict[str, pd.DataFrame] Yes Multi-table data
metadata Metadata Yes Multi-table metadata
main_table_name str Yes Main table for subsetting
num_rows int Yes Target row count for main table
verbose bool No Print info (default: True)

Outputs (get_random_subset)

Name Type Description
return value dict[str, pd.DataFrame] Subsampled data preserving referential integrity

Usage Examples

from sdv.datasets.demo import download_demo
from sdv.utils.poc import simplify_schema, get_random_subset

data, metadata = download_demo(modality='multi_table', dataset_name='fake_hotels')

# Simplify schema
simple_data, simple_metadata = simplify_schema(data, metadata)

# Subsample to 100 rows
subset = get_random_subset(data, metadata, main_table_name='hotels', num_rows=100)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment