Implementation:NVIDIA NeMo Curator AddId
| Knowledge Sources | |
|---|---|
| Domains | Data Curation, Text Processing, Pipeline Stages |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
The AddId stage assigns unique identifier strings to each document record in a DocumentBatch, enabling downstream stages such as splitting and joining to track documents by ID.
Description
AddId is a dataclass-based ProcessingStage that generates deterministic, globally unique identifiers for every document in a batch. Each ID is constructed by combining the batch's UUID with a sequential index, producing the format {prefix}_{uuid}_{i} (when a prefix is specified) or {uuid}_{i} (when no prefix is provided).
The stage is configurable via three parameters: id_field (the column name where the ID will be stored), id_prefix (an optional string prepended to each ID), and overwrite (a boolean controlling whether an existing ID column should be replaced). If the target column already exists and overwrite is False, the stage raises a ValueError. If overwrite is True, the stage logs a warning and replaces the column.
The stage declares ["data"] as its required input artifact and produces ["data"] with the configured id_field as an additional output column.
Usage
Use AddId whenever documents need unique identifiers before downstream processing. This is especially important as a prerequisite for DocumentSplitter and DocumentJoiner, which rely on document IDs to split and reconstruct documents. It is also useful for general traceability and deduplication tracking.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/modules/add_id.py - Lines: 1-82
Signature
@dataclass
class AddId(ProcessingStage[DocumentBatch, DocumentBatch]):
id_field: str
id_prefix: str | None = None
overwrite: bool = False
name: str = "add_id"
def inputs(self) -> tuple[list[str], list[str]]: ...
def outputs(self) -> tuple[list[str], list[str]]: ...
def process(self, batch: DocumentBatch) -> DocumentBatch | None: ...
Import
from nemo_curator.stages.text.modules.add_id import AddId
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| id_field | str | Yes | The column name where the generated ID will be stored |
| id_prefix | str or None | No | An optional prefix prepended to each generated ID |
| overwrite | bool | No | Whether to overwrite an existing ID column (default: False) |
| batch | DocumentBatch | Yes | The input batch of documents to process |
Outputs
| Name | Type | Description |
|---|---|---|
| DocumentBatch | DocumentBatch | A new batch with the ID column added to the DataFrame, containing IDs in the format {prefix}_{uuid}_{index}
|
Usage Examples
Basic Usage
from nemo_curator.stages.text.modules.add_id import AddId
# Create a stage that adds IDs to the "doc_id" column
add_id_stage = AddId(id_field="doc_id")
# Process a batch - each document gets a unique ID like "{uuid}_0", "{uuid}_1", etc.
output_batch = add_id_stage.process(input_batch)
With Prefix and Overwrite
from nemo_curator.stages.text.modules.add_id import AddId
# Create a stage with a custom prefix, allowing overwrite of existing IDs
add_id_stage = AddId(
id_field="doc_id",
id_prefix="wiki",
overwrite=True,
)
# IDs will be in the format "wiki_{uuid}_0", "wiki_{uuid}_1", etc.
output_batch = add_id_stage.process(input_batch)
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_DocumentSplitter - Splits documents into segments; requires IDs assigned by AddId
- NVIDIA_NeMo_Curator_DocumentJoiner - Joins segments back using document IDs
- NVIDIA_NeMo_Curator_DocumentBatch - The task type consumed and produced by this stage