Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Sdv dev SDV Sampling Retry Tuning

From Leeroopedia
Knowledge Sources
Domains Optimization, Debugging, Synthetic_Data
Last Updated 2026-02-14 19:00 GMT

Overview

Tuning `max_tries_per_batch` and `batch_size` parameters to resolve sampling failures when using conditional generation or constrained synthesis.

Description

SDV uses reject sampling to generate synthetic rows that satisfy conditions or constraints. When sampling with conditions, SDV generates a batch of candidate rows, filters those that meet conditions, and retries if not enough valid rows are produced. The retry logic uses an adaptive strategy: each retry generates `min(10 * batch_size, remaining / valid_rate)` rows to overshoot and compensate for the rejection rate. If the valid rate is very low (tight constraints or out-of-distribution conditions), the default parameters may be insufficient.

Usage

Apply this heuristic when you encounter sampling failures with conditional generation:

  • `ValueError: Unable to sample any rows for the given conditions`
  • Extremely slow sampling with conditions
  • Multi-table sampling where `scale` parameter produces tables with only 1 row

The Insight (Rule of Thumb)

  • Action 1: Increase `max_tries_per_batch` (default: 100) in the `sample()` call to allow more retry attempts.
  • Action 2: Increase `batch_size` in the `sample()` call to generate more candidate rows per attempt.
  • Action 3: For multi-table sampling, increase the `scale` parameter if child tables produce only 1 row.
  • Trade-off: Increasing these values increases sampling time proportionally. Very large values may cause memory issues for large datasets.
  • Alternative: If conditions are out-of-bounds for the model, consider retraining with data that better covers the desired condition space.

Reasoning

The retry logic multiplies batch_size by 10 on each retry to adaptively increase the pool of candidate rows. The valid rate formula `num_new_valid_rows / num_rows_to_sample` tracks what fraction of generated rows pass the condition filter. When conditions are narrow (e.g., very specific categorical values or tight ranges), the valid rate drops, requiring exponentially more candidates. The 10x multiplier is a heuristic that balances between generating enough candidates and not overshooting memory limits.

For multi-table sampling, the `scale` parameter controls how many child rows are generated per parent. When `scale` is too small, some child tables may have only 1 row, producing low-quality data.

Code Evidence

Adaptive retry logic from `sdv/single_table/base.py:1019-1021`:

remaining = batch_size - num_valid
valid_rate = max(num_new_valid_rows, 1) / max(num_rows_to_sample, 1)
num_rows_to_sample = min(10 * batch_size, int(remaining / valid_rate))

Error guidance from `sdv/single_table/base.py:1134-1138`:

user_msg = user_msg + (
    f"Try increasing 'max_tries_per_batch' (currently: {max_tries_per_batch}) "
    f"or increasing 'batch_size' (currently: {batch_size}). Note that "
    'increasing these values will also increase the sampling time.'
)

Multi-table scale warning from `sdv/sampling/hierarchical_sampler.py:326-330`:

warn_msg = (
    "The 'scale' parameter is too small. Some tables may have 1 row."
    ' For better quality data, please choose a larger scale.'
)
warnings.warn(warn_msg)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment