Implementation:NVIDIA NeMo Curator DataDesignerStage

Knowledge Sources	NVIDIA NeMo Curator
Domains	Synthetic Data, LLM Integration, Data Augmentation
Last Updated	2026-02-14 00:00 GMT

Overview

DataDesignerStage integrates NVIDIA's NeMo Data Designer (NDD) into NeMo Curator pipelines, enabling LLM-based synthetic data generation as a processing stage that takes seed documents and produces augmented data.

Description

DataDesignerStage extends ProcessingStage[DocumentBatch, DocumentBatch] as a Python dataclass. It wraps the NeMo Data Designer library to generate synthetic data from seed input records. The stage accepts configuration through either a DataDesignerConfigBuilder object or a config file path (exactly one must be provided).

During __post_init__, the stage validates that exactly one configuration source is set, initializes the DataDesigner instance (optionally with custom model providers), and sets default resources with zero GPUs (customizable via .with_(resources=Resources(gpus=X))).

The process() method sets the input DocumentBatch as a seed DataFrame in the config builder, calls DataDesigner.preview() to generate synthetic records, collects token statistics (input/output token medians per record across all LLM columns), measures execution time, and logs all metrics. When verbose is False (default), NDD's own logging is suppressed to WARNING level during generation.

Usage

Use DataDesignerStage when you need to augment an existing dataset with LLM-generated synthetic data within a Curator pipeline. Configure it with a NDD config builder or config file, optionally specify custom model providers for testing, and insert it into the pipeline to generate synthetic records from seed data.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/synthetic/nemo_data_designer/data_designer.py
Lines: 1-143

Signature

@dataclass
class DataDesignerStage(ProcessingStage[DocumentBatch, DocumentBatch]):
    config_builder: dd.DataDesignerConfigBuilder | None = None
    data_designer_config_file: str | None = None
    model_providers: list | None = None
    verbose: bool = False
    data_designer: DataDesigner = field(init=False)

    def __post_init__(self) -> None: ...
    def inputs(self) -> tuple[list[str], list[str]]: ...
    def outputs(self) -> tuple[list[str], list[str]]: ...
    def process(self, batch: DocumentBatch) -> DocumentBatch: ...

Import

from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage

I/O Contract

Inputs

Name	Type	Required	Description
config_builder	`DataDesignerConfigBuilder`	Conditional	NDD config builder object. Either this or `data_designer_config_file` must be set, but not both.
data_designer_config_file	`str`	Conditional	Path to a NDD config file. Either this or `config_builder` must be set, but not both.
model_providers	`list`	No	List of custom `ModelProvider` instances for custom or test LLM endpoints. Defaults to NDD's built-in providers.
verbose	`bool`	No	When True, NDD logging is shown in full. When False (default), NDD logs are suppressed to WARNING level.
batch	`DocumentBatch`	Yes	The input document batch used as seed data for synthetic generation

Outputs

Name	Type	Description
result	`DocumentBatch`	A document batch containing the synthetically generated data, preserving task metadata from the input batch

Key Implementation Details

Configuration Validation

The __post_init__ method enforces that exactly one configuration source is provided:

if self.config_builder is None and self.data_designer_config_file is None:
    msg = "Either 'config_builder' or 'data_designer_config_file' must be set."
    raise ValueError(msg)
if self.config_builder is not None and self.data_designer_config_file is not None:
    msg = "Only one of 'config_builder' or 'data_designer_config_file' can be set, not both."
    raise ValueError(msg)

Logging Suppression

When verbose=False, NDD log output (e.g., "Preview generation in progress") is suppressed by temporarily raising the NDD logger level to WARNING:

ndd_logger = logging.getLogger("data_designer")
if not self.verbose:
    _old_ndd_level = ndd_logger.level
    ndd_logger.setLevel(logging.WARNING)
try:
    results = self.data_designer.preview(self.config_builder, num_records=num_input_records)
finally:
    if not self.verbose:
        ndd_logger.setLevel(_old_ndd_level)

Metrics Collection

The stage collects and logs performance metrics including NDD running time, input/output record counts, and median token usage per record aggregated across all LLM-generated columns:

self._log_metrics({
    "ndd_running_time": ndd_running_time,
    "num_input_records": float(num_input_records),
    "num_output_records": float(num_output_records),
    "input_tokens_median_per_record": float(input_tokens_median_per_record),
    "output_tokens_median_per_record": float(output_tokens_median_per_record),
})

Usage Examples

Using a Config Builder

import data_designer.config as dd
from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage

config_builder = dd.DataDesignerConfigBuilder()
# ... configure the builder ...

stage = DataDesignerStage(config_builder=config_builder)

Using a Config File

from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage

stage = DataDesignerStage(data_designer_config_file="/path/to/ndd_config.yaml")

With Custom Model Providers

from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage
from nemo_curator.stages.resources import Resources

stage = DataDesignerStage(
    data_designer_config_file="/path/to/config.yaml",
    model_providers=[my_custom_provider],
    verbose=True,
).with_(resources=Resources(gpus=1))

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_BaseSyntheticStage - Alternative base class for Nemotron-CC synthetic stages
NVIDIA_NeMo_Curator_NemotronCC_Stages - Concrete Nemotron-CC synthetic stages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment