Implementation:NVIDIA NeMo Curator AddId

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Curation, Text Processing, Pipeline Stages
Last Updated	2026-02-14 00:00 GMT

Overview

The AddId stage assigns unique identifier strings to each document record in a DocumentBatch, enabling downstream stages such as splitting and joining to track documents by ID.

Description

AddId is a dataclass-based ProcessingStage that generates deterministic, globally unique identifiers for every document in a batch. Each ID is constructed by combining the batch's UUID with a sequential index, producing the format {prefix}_{uuid}_{i} (when a prefix is specified) or {uuid}_{i} (when no prefix is provided).

The stage is configurable via three parameters: id_field (the column name where the ID will be stored), id_prefix (an optional string prepended to each ID), and overwrite (a boolean controlling whether an existing ID column should be replaced). If the target column already exists and overwrite is False, the stage raises a ValueError. If overwrite is True, the stage logs a warning and replaces the column.

The stage declares ["data"] as its required input artifact and produces ["data"] with the configured id_field as an additional output column.

Usage

Use AddId whenever documents need unique identifiers before downstream processing. This is especially important as a prerequisite for DocumentSplitter and DocumentJoiner, which rely on document IDs to split and reconstruct documents. It is also useful for general traceability and deduplication tracking.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/modules/add_id.py
Lines: 1-82

Signature

@dataclass
class AddId(ProcessingStage[DocumentBatch, DocumentBatch]):
    id_field: str
    id_prefix: str | None = None
    overwrite: bool = False
    name: str = "add_id"

    def inputs(self) -> tuple[list[str], list[str]]: ...
    def outputs(self) -> tuple[list[str], list[str]]: ...
    def process(self, batch: DocumentBatch) -> DocumentBatch | None: ...

Import

from nemo_curator.stages.text.modules.add_id import AddId

I/O Contract

Inputs

Name	Type	Required	Description
id_field	str	Yes	The column name where the generated ID will be stored
id_prefix	str or None	No	An optional prefix prepended to each generated ID
overwrite	bool	No	Whether to overwrite an existing ID column (default: False)
batch	DocumentBatch	Yes	The input batch of documents to process

Outputs

Name	Type	Description
DocumentBatch	DocumentBatch	A new batch with the ID column added to the DataFrame, containing IDs in the format `{prefix}_{uuid}_{index}`

Usage Examples

Basic Usage

from nemo_curator.stages.text.modules.add_id import AddId

# Create a stage that adds IDs to the "doc_id" column
add_id_stage = AddId(id_field="doc_id")

# Process a batch - each document gets a unique ID like "{uuid}_0", "{uuid}_1", etc.
output_batch = add_id_stage.process(input_batch)

With Prefix and Overwrite

from nemo_curator.stages.text.modules.add_id import AddId

# Create a stage with a custom prefix, allowing overwrite of existing IDs
add_id_stage = AddId(
    id_field="doc_id",
    id_prefix="wiki",
    overwrite=True,
)

# IDs will be in the format "wiki_{uuid}_0", "wiki_{uuid}_1", etc.
output_batch = add_id_stage.process(input_batch)

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_DocumentSplitter - Splits documents into segments; requires IDs assigned by AddId
NVIDIA_NeMo_Curator_DocumentJoiner - Joins segments back using document IDs
NVIDIA_NeMo_Curator_DocumentBatch - The task type consumed and produced by this stage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment