Principle:Sdv dev SDV Schema Simplification
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Synthetic_Data |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
A data reduction technique that simplifies complex multi-table schemas by removing distant tables and excess columns to enable faster prototyping with hierarchical synthesizers.
Description
Schema simplification addresses the challenge of working with large, complex relational databases when using HMA synthesis. Complex schemas with many tables and columns can cause the HMA algorithm to create an excessive number of extension columns during table augmentation, leading to slow fitting and poor quality. The simplification process removes tables beyond the grandchild level, strips modelable columns from grandchild tables, reduces columns in child tables, and eliminates relationships not connected to the main root table.
A companion operation, random subsetting, reduces the number of rows while preserving referential integrity.
Usage
Use schema simplification as an optional preprocessing step before HMASynthesizer when the multi-table dataset has a complex schema with many tables or columns. It is particularly useful for proof-of-concept workflows where fast iteration is more important than complete fidelity.
Theoretical Basis
The simplification algorithm operates hierarchically:
- Identify root table: Find the table with no parent (or the largest root if multiple exist)
- Prune distant tables: Keep only children and grandchildren of the root
- Reduce grandchild columns: Remove all modelable columns from grandchild tables (keep only keys)
- Reduce child columns: Keep a subset of modelable columns in child tables
- Update metadata: Remove pruned relationships and columns from metadata
- Estimate column count: Only simplify if estimated extension columns exceed the threshold (1000)