Heuristic:Sdv dev SDV HMA Schema Simplification

Knowledge Sources	SDV DataCebo engineering
Domains	Optimization, Synthetic_Data, Multi_Table
Last Updated	2026-02-14 19:00 GMT

Overview

Performance optimization: use `simplify_schema` to reduce multi-table complexity when HMASynthesizer estimates more than 1000 columns for the flattened representation.

Description

The HMASynthesizer (Hierarchical Modeling Algorithm) models child table distributions by extending parent rows with statistical parameters derived from child data. Each child column generates multiple parameter columns (e.g., `beta` distribution generates 4 parameter columns per child column). For schemas with many tables and columns, this can produce an enormous flattened column count. SDV provides `sdv.utils.poc.simplify_schema` to automatically reduce schema complexity by removing distant tables and non-essential columns.

Usage

Apply this heuristic when using HMASynthesizer on multi-table schemas. If SDV prints a PerformanceAlert during fitting indicating that the estimated column count exceeds 1000, simplify the schema before training. This is also useful for proof-of-concept work where full schema fidelity is not required.

The Insight (Rule of Thumb)

Action: Call `sdv.utils.poc.simplify_schema(metadata, data)` before creating the HMASynthesizer.
Threshold: The alert triggers when estimated total columns exceed MAX_NUMBER_OF_COLUMNS = 1000. Display cap is 1,000,000+ for very large schemas.
Trade-off: Schema simplification removes grandchild table columns and distant relationships, reducing fidelity for those tables. Root and direct child tables retain full modeling.
Alternative: For enterprise-scale schemas, DataCebo recommends contacting them for enterprise solutions.

Reasoning

Each distribution type in HMA generates a different number of parameter columns per source column:

Distribution	Parameter Columns
beta	4
truncnorm	4
gamma	3
norm	2
uniform	2

For a child table with 50 columns using the `beta` distribution, this generates 200 additional columns in the parent. With multiple levels of hierarchy and many tables, column counts can reach hundreds of thousands, making GaussianCopula fitting intractable.

Code Evidence

HMA performance constants from `sdv/multi_table/hma.py:19-21`:

PERFORMANCE_ALERT_DISPLAY_CAP = 1_000_000
DEFAULT_EXTENDED_COLUMNS_DISTRIBUTION = 'truncnorm'
MAX_NUMBER_OF_COLUMNS = 1000

PerformanceAlert trigger from `sdv/multi_table/hma.py:277-298`:

if total_est_cols > MAX_NUMBER_OF_COLUMNS:
    self._print(
        'PerformanceAlert: Using the HMASynthesizer on this metadata '
        'schema is not recommended. To model this data, HMA will '
        f'generate a large number of columns. ({display_total} columns)\n\n'
    )
    self._print(
        'We recommend simplifying your metadata schema using '
        "'sdv.utils.poc.simplify_schema'.\nIf this is not possible, please visit "
        'datacebo.com and reach out to us for enterprise solutions.\n'
    )

Distribution-to-parameter mapping from `sdv/multi_table/hma.py:38-43`:

DISTRIBUTIONS_TO_NUM_PARAMETER_COLUMNS = {
    'beta': 4,
    'truncnorm': 4,
    'gamma': 3,
    'norm': 2,
    'uniform': 2,
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment