Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Sdv dev SDV Constrained synthesis

From Leeroopedia
Knowledge Sources
Domains Synthetic_Data, Constraints, Data_Quality, Business_Rules
Last Updated 2026-02-14 19:00 GMT

Overview

End-to-end process for generating synthetic data that respects business rules and domain constraints using the SDV constraint system, ensuring synthetic outputs satisfy logical relationships between columns.

Description

This workflow covers adding business logic constraints to the synthesis process so that generated data satisfies domain-specific rules. The SDV provides two constraint systems: the CAG (Constraint-Augmented Generation) system (newer, supports both single and multi-table) and the legacy constraint system (single-table only, integrated with DataProcessor). CAG constraints include FixedCombinations, Inequality, Range, FixedIncrements, OneHotEncoding, and ProgrammableConstraint (for custom logic). Constraints transform data before model fitting and reverse-transform after sampling, ensuring the model never sees invalid states. If constraints are not perfectly satisfied after reverse transformation, a reject sampling loop filters out invalid rows.

Usage

Execute this workflow when your synthetic data must satisfy logical rules such as: start dates must precede end dates, certain column combinations must match real-world categories, values must fall within specific ranges, or custom business rules defined in Python. This is essential for generating realistic data for systems that enforce data validation rules.

Execution Steps

Step 1: Identify business rules

Analyze the real data and domain requirements to identify which logical constraints must hold in the synthetic output. Common constraint types include column ordering (inequalities), fixed value combinations, numerical ranges, and custom validation functions.

Key considerations:

  • Inequality constraints enforce ordering between two columns (e.g., start_date < end_date)
  • FixedCombinations constraints ensure certain column value tuples only appear in combinations seen in real data
  • Range constraints enforce that a value falls between two other columns or fixed bounds
  • FixedIncrements constraints ensure numerical values are multiples of a specified increment
  • OneHotEncoding constraints ensure exactly one column in a group is 1 and the rest are 0
  • ProgrammableConstraint allows arbitrary Python validation and transformation logic

Step 2: Define constraint objects

Instantiate constraint objects from the CAG module, specifying the relevant columns and parameters. Each constraint class has a specific constructor signature that defines which columns participate and the relationship between them.

Key considerations:

  • Constraints are instantiated with column name arguments and optional configuration
  • Inequality takes low_column_name and high_column_name with optional strict flag
  • FixedCombinations takes a list of column_names whose joint values are preserved
  • Range takes low_column_name, middle_column_name, and high_column_name
  • ProgrammableConstraint requires defining is_valid, transform, and reverse_transform methods

Step 3: Add constraints to synthesizer

Attach constraint objects to the synthesizer before fitting. For single-table synthesizers, use add_constraints with a list of constraint instances. For multi-table synthesizers, constraints are similarly added and apply across the data dictionary.

Key considerations:

  • Constraints must be added before calling fit
  • Multiple constraints can be combined on the same synthesizer
  • The synthesizer validates that constraint columns exist in the metadata
  • Adding constraints after fitting requires refitting the model

Step 4: Fit synthesizer with constraints

Call fit with the real data. The constraint pipeline transforms the data (e.g., replacing a high column with the difference between high and low) before the model sees it, then reverse-transforms during sampling. The DataProcessor integrates legacy constraints while CAG constraints operate at the synthesizer level.

Key considerations:

  • Constraint transforms must be invertible (the reverse transform reconstructs original column semantics)
  • The model learns distributions on the transformed data, which is constraint-free
  • Invalid rows in the training data are flagged and can raise warnings
  • Constraints fit on the training data to learn parameters (e.g., valid combinations)

Step 5: Sample constrained synthetic data

Generate synthetic rows. After the model produces raw samples and reverse-transforms are applied, a validation step checks each row against all constraints. Rows that violate constraints are rejected and regenerated in a reject-sampling loop until the requested count is met.

Key considerations:

  • Reject sampling may require multiple generation rounds if constraint satisfaction is difficult
  • The synthesizer tracks and reports constraint violation rates
  • Overly restrictive constraints can lead to slow sampling or failures
  • The ConstraintsNotMetError is raised if constraints cannot be satisfied after maximum attempts

Step 6: Validate constraint satisfaction

Verify that all generated rows satisfy the defined constraints by checking the constraint is_valid method against the synthetic data. Report any violations and assess overall constraint satisfaction rate.

Key considerations:

  • Use each constraint's is_valid method to produce a boolean mask
  • The CAG utility _get_invalid_rows identifies which rows violate which constraints
  • 100% satisfaction is expected after successful sampling
  • For ProgrammableConstraints, test the custom validation logic thoroughly

Execution Diagram

GitHub URL

Workflow Repository