Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Sdv dev SDV Schema Simplification

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Synthetic_Data
Last Updated 2026-02-14 00:00 GMT

Overview

A data reduction technique that simplifies complex multi-table schemas by removing distant tables and excess columns to enable faster prototyping with hierarchical synthesizers.

Description

Schema simplification addresses the challenge of working with large, complex relational databases when using HMA synthesis. Complex schemas with many tables and columns can cause the HMA algorithm to create an excessive number of extension columns during table augmentation, leading to slow fitting and poor quality. The simplification process removes tables beyond the grandchild level, strips modelable columns from grandchild tables, reduces columns in child tables, and eliminates relationships not connected to the main root table.

A companion operation, random subsetting, reduces the number of rows while preserving referential integrity.

Usage

Use schema simplification as an optional preprocessing step before HMASynthesizer when the multi-table dataset has a complex schema with many tables or columns. It is particularly useful for proof-of-concept workflows where fast iteration is more important than complete fidelity.

Theoretical Basis

The simplification algorithm operates hierarchically:

  1. Identify root table: Find the table with no parent (or the largest root if multiple exist)
  2. Prune distant tables: Keep only children and grandchildren of the root
  3. Reduce grandchild columns: Remove all modelable columns from grandchild tables (keep only keys)
  4. Reduce child columns: Keep a subset of modelable columns in child tables
  5. Update metadata: Remove pruned relationships and columns from metadata
  6. Estimate column count: Only simplify if estimated extension columns exceed the threshold (1000)

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment