Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator AddId

From Leeroopedia
Knowledge Sources
Domains Data Curation, Text Processing, Pipeline Stages
Last Updated 2026-02-14 00:00 GMT

Overview

The AddId stage assigns unique identifier strings to each document record in a DocumentBatch, enabling downstream stages such as splitting and joining to track documents by ID.

Description

AddId is a dataclass-based ProcessingStage that generates deterministic, globally unique identifiers for every document in a batch. Each ID is constructed by combining the batch's UUID with a sequential index, producing the format {prefix}_{uuid}_{i} (when a prefix is specified) or {uuid}_{i} (when no prefix is provided).

The stage is configurable via three parameters: id_field (the column name where the ID will be stored), id_prefix (an optional string prepended to each ID), and overwrite (a boolean controlling whether an existing ID column should be replaced). If the target column already exists and overwrite is False, the stage raises a ValueError. If overwrite is True, the stage logs a warning and replaces the column.

The stage declares ["data"] as its required input artifact and produces ["data"] with the configured id_field as an additional output column.

Usage

Use AddId whenever documents need unique identifiers before downstream processing. This is especially important as a prerequisite for DocumentSplitter and DocumentJoiner, which rely on document IDs to split and reconstruct documents. It is also useful for general traceability and deduplication tracking.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/modules/add_id.py
  • Lines: 1-82

Signature

@dataclass
class AddId(ProcessingStage[DocumentBatch, DocumentBatch]):
    id_field: str
    id_prefix: str | None = None
    overwrite: bool = False
    name: str = "add_id"

    def inputs(self) -> tuple[list[str], list[str]]: ...
    def outputs(self) -> tuple[list[str], list[str]]: ...
    def process(self, batch: DocumentBatch) -> DocumentBatch | None: ...

Import

from nemo_curator.stages.text.modules.add_id import AddId

I/O Contract

Inputs

Name Type Required Description
id_field str Yes The column name where the generated ID will be stored
id_prefix str or None No An optional prefix prepended to each generated ID
overwrite bool No Whether to overwrite an existing ID column (default: False)
batch DocumentBatch Yes The input batch of documents to process

Outputs

Name Type Description
DocumentBatch DocumentBatch A new batch with the ID column added to the DataFrame, containing IDs in the format {prefix}_{uuid}_{index}

Usage Examples

Basic Usage

from nemo_curator.stages.text.modules.add_id import AddId

# Create a stage that adds IDs to the "doc_id" column
add_id_stage = AddId(id_field="doc_id")

# Process a batch - each document gets a unique ID like "{uuid}_0", "{uuid}_1", etc.
output_batch = add_id_stage.process(input_batch)

With Prefix and Overwrite

from nemo_curator.stages.text.modules.add_id import AddId

# Create a stage with a custom prefix, allowing overwrite of existing IDs
add_id_stage = AddId(
    id_field="doc_id",
    id_prefix="wiki",
    overwrite=True,
)

# IDs will be in the format "wiki_{uuid}_0", "wiki_{uuid}_1", etc.
output_batch = add_id_stage.process(input_batch)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment