Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator DataDesignerStage

From Leeroopedia
Knowledge Sources
Domains Synthetic Data, LLM Integration, Data Augmentation
Last Updated 2026-02-14 00:00 GMT

Overview

DataDesignerStage integrates NVIDIA's NeMo Data Designer (NDD) into NeMo Curator pipelines, enabling LLM-based synthetic data generation as a processing stage that takes seed documents and produces augmented data.

Description

DataDesignerStage extends ProcessingStage[DocumentBatch, DocumentBatch] as a Python dataclass. It wraps the NeMo Data Designer library to generate synthetic data from seed input records. The stage accepts configuration through either a DataDesignerConfigBuilder object or a config file path (exactly one must be provided).

During __post_init__, the stage validates that exactly one configuration source is set, initializes the DataDesigner instance (optionally with custom model providers), and sets default resources with zero GPUs (customizable via .with_(resources=Resources(gpus=X))).

The process() method sets the input DocumentBatch as a seed DataFrame in the config builder, calls DataDesigner.preview() to generate synthetic records, collects token statistics (input/output token medians per record across all LLM columns), measures execution time, and logs all metrics. When verbose is False (default), NDD's own logging is suppressed to WARNING level during generation.

Usage

Use DataDesignerStage when you need to augment an existing dataset with LLM-generated synthetic data within a Curator pipeline. Configure it with a NDD config builder or config file, optionally specify custom model providers for testing, and insert it into the pipeline to generate synthetic records from seed data.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/synthetic/nemo_data_designer/data_designer.py
  • Lines: 1-143

Signature

@dataclass
class DataDesignerStage(ProcessingStage[DocumentBatch, DocumentBatch]):
    config_builder: dd.DataDesignerConfigBuilder | None = None
    data_designer_config_file: str | None = None
    model_providers: list | None = None
    verbose: bool = False
    data_designer: DataDesigner = field(init=False)

    def __post_init__(self) -> None: ...
    def inputs(self) -> tuple[list[str], list[str]]: ...
    def outputs(self) -> tuple[list[str], list[str]]: ...
    def process(self, batch: DocumentBatch) -> DocumentBatch: ...

Import

from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage

I/O Contract

Inputs

Name Type Required Description
config_builder DataDesignerConfigBuilder Conditional NDD config builder object. Either this or data_designer_config_file must be set, but not both.
data_designer_config_file str Conditional Path to a NDD config file. Either this or config_builder must be set, but not both.
model_providers list No List of custom ModelProvider instances for custom or test LLM endpoints. Defaults to NDD's built-in providers.
verbose bool No When True, NDD logging is shown in full. When False (default), NDD logs are suppressed to WARNING level.
batch DocumentBatch Yes The input document batch used as seed data for synthetic generation

Outputs

Name Type Description
result DocumentBatch A document batch containing the synthetically generated data, preserving task metadata from the input batch

Key Implementation Details

Configuration Validation

The __post_init__ method enforces that exactly one configuration source is provided:

if self.config_builder is None and self.data_designer_config_file is None:
    msg = "Either 'config_builder' or 'data_designer_config_file' must be set."
    raise ValueError(msg)
if self.config_builder is not None and self.data_designer_config_file is not None:
    msg = "Only one of 'config_builder' or 'data_designer_config_file' can be set, not both."
    raise ValueError(msg)

Logging Suppression

When verbose=False, NDD log output (e.g., "Preview generation in progress") is suppressed by temporarily raising the NDD logger level to WARNING:

ndd_logger = logging.getLogger("data_designer")
if not self.verbose:
    _old_ndd_level = ndd_logger.level
    ndd_logger.setLevel(logging.WARNING)
try:
    results = self.data_designer.preview(self.config_builder, num_records=num_input_records)
finally:
    if not self.verbose:
        ndd_logger.setLevel(_old_ndd_level)

Metrics Collection

The stage collects and logs performance metrics including NDD running time, input/output record counts, and median token usage per record aggregated across all LLM-generated columns:

self._log_metrics({
    "ndd_running_time": ndd_running_time,
    "num_input_records": float(num_input_records),
    "num_output_records": float(num_output_records),
    "input_tokens_median_per_record": float(input_tokens_median_per_record),
    "output_tokens_median_per_record": float(output_tokens_median_per_record),
})

Usage Examples

Using a Config Builder

import data_designer.config as dd
from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage

config_builder = dd.DataDesignerConfigBuilder()
# ... configure the builder ...

stage = DataDesignerStage(config_builder=config_builder)

Using a Config File

from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage

stage = DataDesignerStage(data_designer_config_file="/path/to/ndd_config.yaml")

With Custom Model Providers

from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage
from nemo_curator.stages.resources import Resources

stage = DataDesignerStage(
    data_designer_config_file="/path/to/config.yaml",
    model_providers=[my_custom_provider],
    verbose=True,
).with_(resources=Resources(gpus=1))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment