Heuristic:Sdv dev SDV HMA Schema Simplification
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Synthetic_Data, Multi_Table |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Performance optimization: use `simplify_schema` to reduce multi-table complexity when HMASynthesizer estimates more than 1000 columns for the flattened representation.
Description
The HMASynthesizer (Hierarchical Modeling Algorithm) models child table distributions by extending parent rows with statistical parameters derived from child data. Each child column generates multiple parameter columns (e.g., `beta` distribution generates 4 parameter columns per child column). For schemas with many tables and columns, this can produce an enormous flattened column count. SDV provides `sdv.utils.poc.simplify_schema` to automatically reduce schema complexity by removing distant tables and non-essential columns.
Usage
Apply this heuristic when using HMASynthesizer on multi-table schemas. If SDV prints a PerformanceAlert during fitting indicating that the estimated column count exceeds 1000, simplify the schema before training. This is also useful for proof-of-concept work where full schema fidelity is not required.
The Insight (Rule of Thumb)
- Action: Call `sdv.utils.poc.simplify_schema(metadata, data)` before creating the HMASynthesizer.
- Threshold: The alert triggers when estimated total columns exceed MAX_NUMBER_OF_COLUMNS = 1000. Display cap is 1,000,000+ for very large schemas.
- Trade-off: Schema simplification removes grandchild table columns and distant relationships, reducing fidelity for those tables. Root and direct child tables retain full modeling.
- Alternative: For enterprise-scale schemas, DataCebo recommends contacting them for enterprise solutions.
Reasoning
Each distribution type in HMA generates a different number of parameter columns per source column:
| Distribution | Parameter Columns |
|---|---|
| beta | 4 |
| truncnorm | 4 |
| gamma | 3 |
| norm | 2 |
| uniform | 2 |
For a child table with 50 columns using the `beta` distribution, this generates 200 additional columns in the parent. With multiple levels of hierarchy and many tables, column counts can reach hundreds of thousands, making GaussianCopula fitting intractable.
Code Evidence
HMA performance constants from `sdv/multi_table/hma.py:19-21`:
PERFORMANCE_ALERT_DISPLAY_CAP = 1_000_000
DEFAULT_EXTENDED_COLUMNS_DISTRIBUTION = 'truncnorm'
MAX_NUMBER_OF_COLUMNS = 1000
PerformanceAlert trigger from `sdv/multi_table/hma.py:277-298`:
if total_est_cols > MAX_NUMBER_OF_COLUMNS:
self._print(
'PerformanceAlert: Using the HMASynthesizer on this metadata '
'schema is not recommended. To model this data, HMA will '
f'generate a large number of columns. ({display_total} columns)\n\n'
)
self._print(
'We recommend simplifying your metadata schema using '
"'sdv.utils.poc.simplify_schema'.\nIf this is not possible, please visit "
'datacebo.com and reach out to us for enterprise solutions.\n'
)
Distribution-to-parameter mapping from `sdv/multi_table/hma.py:38-43`:
DISTRIBUTIONS_TO_NUM_PARAMETER_COLUMNS = {
'beta': 4,
'truncnorm': 4,
'gamma': 3,
'norm': 2,
'uniform': 2,
}