Implementation:Sdv dev SDV Simplify Schema
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Synthetic_Data |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for simplifying multi-table schemas and subsampling data for proof-of-concept HMA synthesis, provided by the SDV library.
Description
The simplify_schema function reduces complex multi-table schemas by pruning distant tables and excess columns. The get_random_subset function subsamples rows while preserving referential integrity. Both are used for proof-of-concept workflows with HMASynthesizer.
Usage
Import these functions when working with complex multi-table data that causes HMASynthesizer to be slow or produce poor results.
Code Reference
Source Location
- Repository: SDV
- File: sdv/utils/poc.py
- Lines: L29-141
Signature
def simplify_schema(data, metadata, verbose=True):
"""Simplify the schema of the data and metadata.
Args:
data (dict): Dictionary mapping table names to DataFrames.
metadata (MultiTableMetadata): Metadata of the datasets.
verbose (bool): Print simplification info. Defaults to True.
Returns:
tuple: (simplified_data: dict, simplified_metadata: MultiTableMetadata)
"""
def get_random_subset(data, metadata, main_table_name, num_rows, verbose=True):
"""Subsample multi-table data based on a table and number of rows.
Args:
data (dict): Dictionary mapping table names to DataFrames.
metadata (MultiTableMetadata): Metadata of the datasets.
main_table_name (str): Name of the main table.
num_rows (int): Number of rows to keep in the main table.
verbose (bool): Print subsampling info. Defaults to True.
Returns:
dict: Dictionary with subsampled DataFrames.
"""
Import
from sdv.utils.poc import simplify_schema, get_random_subset
I/O Contract
Inputs (simplify_schema)
| Name | Type | Required | Description |
|---|---|---|---|
| data | dict[str, pd.DataFrame] | Yes | Multi-table data |
| metadata | Metadata | Yes | Multi-table metadata with relationships |
| verbose | bool | No | Print info (default: True) |
Outputs (simplify_schema)
| Name | Type | Description |
|---|---|---|
| return value | tuple(dict, Metadata) | Simplified data and metadata |
Inputs (get_random_subset)
| Name | Type | Required | Description |
|---|---|---|---|
| data | dict[str, pd.DataFrame] | Yes | Multi-table data |
| metadata | Metadata | Yes | Multi-table metadata |
| main_table_name | str | Yes | Main table for subsetting |
| num_rows | int | Yes | Target row count for main table |
| verbose | bool | No | Print info (default: True) |
Outputs (get_random_subset)
| Name | Type | Description |
|---|---|---|
| return value | dict[str, pd.DataFrame] | Subsampled data preserving referential integrity |
Usage Examples
from sdv.datasets.demo import download_demo
from sdv.utils.poc import simplify_schema, get_random_subset
data, metadata = download_demo(modality='multi_table', dataset_name='fake_hotels')
# Simplify schema
simple_data, simple_metadata = simplify_schema(data, metadata)
# Subsample to 100 rows
subset = get_random_subset(data, metadata, main_table_name='hotels', num_rows=100)
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment