Implementation:NVIDIA NeMo Curator DataDesignerStage
| Knowledge Sources | |
|---|---|
| Domains | Synthetic Data, LLM Integration, Data Augmentation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
DataDesignerStage integrates NVIDIA's NeMo Data Designer (NDD) into NeMo Curator pipelines, enabling LLM-based synthetic data generation as a processing stage that takes seed documents and produces augmented data.
Description
DataDesignerStage extends ProcessingStage[DocumentBatch, DocumentBatch] as a Python dataclass. It wraps the NeMo Data Designer library to generate synthetic data from seed input records. The stage accepts configuration through either a DataDesignerConfigBuilder object or a config file path (exactly one must be provided).
During __post_init__, the stage validates that exactly one configuration source is set, initializes the DataDesigner instance (optionally with custom model providers), and sets default resources with zero GPUs (customizable via .with_(resources=Resources(gpus=X))).
The process() method sets the input DocumentBatch as a seed DataFrame in the config builder, calls DataDesigner.preview() to generate synthetic records, collects token statistics (input/output token medians per record across all LLM columns), measures execution time, and logs all metrics. When verbose is False (default), NDD's own logging is suppressed to WARNING level during generation.
Usage
Use DataDesignerStage when you need to augment an existing dataset with LLM-generated synthetic data within a Curator pipeline. Configure it with a NDD config builder or config file, optionally specify custom model providers for testing, and insert it into the pipeline to generate synthetic records from seed data.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/synthetic/nemo_data_designer/data_designer.py - Lines: 1-143
Signature
@dataclass
class DataDesignerStage(ProcessingStage[DocumentBatch, DocumentBatch]):
config_builder: dd.DataDesignerConfigBuilder | None = None
data_designer_config_file: str | None = None
model_providers: list | None = None
verbose: bool = False
data_designer: DataDesigner = field(init=False)
def __post_init__(self) -> None: ...
def inputs(self) -> tuple[list[str], list[str]]: ...
def outputs(self) -> tuple[list[str], list[str]]: ...
def process(self, batch: DocumentBatch) -> DocumentBatch: ...
Import
from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config_builder | DataDesignerConfigBuilder |
Conditional | NDD config builder object. Either this or data_designer_config_file must be set, but not both.
|
| data_designer_config_file | str |
Conditional | Path to a NDD config file. Either this or config_builder must be set, but not both.
|
| model_providers | list |
No | List of custom ModelProvider instances for custom or test LLM endpoints. Defaults to NDD's built-in providers.
|
| verbose | bool |
No | When True, NDD logging is shown in full. When False (default), NDD logs are suppressed to WARNING level. |
| batch | DocumentBatch |
Yes | The input document batch used as seed data for synthetic generation |
Outputs
| Name | Type | Description |
|---|---|---|
| result | DocumentBatch |
A document batch containing the synthetically generated data, preserving task metadata from the input batch |
Key Implementation Details
Configuration Validation
The __post_init__ method enforces that exactly one configuration source is provided:
if self.config_builder is None and self.data_designer_config_file is None:
msg = "Either 'config_builder' or 'data_designer_config_file' must be set."
raise ValueError(msg)
if self.config_builder is not None and self.data_designer_config_file is not None:
msg = "Only one of 'config_builder' or 'data_designer_config_file' can be set, not both."
raise ValueError(msg)
Logging Suppression
When verbose=False, NDD log output (e.g., "Preview generation in progress") is suppressed by temporarily raising the NDD logger level to WARNING:
ndd_logger = logging.getLogger("data_designer")
if not self.verbose:
_old_ndd_level = ndd_logger.level
ndd_logger.setLevel(logging.WARNING)
try:
results = self.data_designer.preview(self.config_builder, num_records=num_input_records)
finally:
if not self.verbose:
ndd_logger.setLevel(_old_ndd_level)
Metrics Collection
The stage collects and logs performance metrics including NDD running time, input/output record counts, and median token usage per record aggregated across all LLM-generated columns:
self._log_metrics({
"ndd_running_time": ndd_running_time,
"num_input_records": float(num_input_records),
"num_output_records": float(num_output_records),
"input_tokens_median_per_record": float(input_tokens_median_per_record),
"output_tokens_median_per_record": float(output_tokens_median_per_record),
})
Usage Examples
Using a Config Builder
import data_designer.config as dd
from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage
config_builder = dd.DataDesignerConfigBuilder()
# ... configure the builder ...
stage = DataDesignerStage(config_builder=config_builder)
Using a Config File
from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage
stage = DataDesignerStage(data_designer_config_file="/path/to/ndd_config.yaml")
With Custom Model Providers
from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage
from nemo_curator.stages.resources import Resources
stage = DataDesignerStage(
data_designer_config_file="/path/to/config.yaml",
model_providers=[my_custom_provider],
verbose=True,
).with_(resources=Resources(gpus=1))
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_BaseSyntheticStage - Alternative base class for Nemotron-CC synthetic stages
- NVIDIA_NeMo_Curator_NemotronCC_Stages - Concrete Nemotron-CC synthetic stages