Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator DocumentJoiner

From Leeroopedia
Knowledge Sources
Domains Data Curation, Text Processing, Pipeline Stages
Last Updated 2026-02-14 00:00 GMT

Overview

The DocumentJoiner stage reconstructs documents from their segments by joining rows that share a common document ID, serving as the inverse operation of DocumentSplitter.

Description

DocumentJoiner is a dataclass-based ProcessingStage that groups document segments by their document_id_field, sorts them by segment_id_field, and concatenates the text using a configurable separator. For non-text columns, the first occurrence within each group is preserved.

The stage supports two joining modes:

Simple join (no max_length): Groups segments by document ID, sorts by segment ID, joins text fields with the separator using pandas groupby().agg(), and takes the first value for all other columns.

Length-constrained join (with max_length): Uses the _join_segments() method to greedily accumulate segments up to a maximum length. When adding the next segment would exceed max_length (accounting for separator length), the current accumulation is committed as a new joined segment and a fresh accumulation begins. This produces multiple output rows per document when the total content exceeds the length limit.

Validation in __post_init__ enforces that max_length and length_field must both be specified or both be omitted. The drop_segment_id_field parameter (default: True) controls whether the segment ID column is removed from the output.

Important: All segments belonging to the same document must be present within a single DocumentBatch. Segments from the same document split across multiple batches will not be joined together.

Usage

Use DocumentJoiner after performing segment-level operations (such as per-paragraph filtering, scoring, or modification) to reassemble the processed segments back into complete documents. It is designed to work in tandem with DocumentSplitter. The max_length option is useful when downstream stages require documents to stay within a certain size limit.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/modules/joiner.py
  • Lines: 1-193

Signature

@dataclass
class DocumentJoiner(ProcessingStage[DocumentBatch, DocumentBatch]):
    separator: str = "\n\n"
    text_field: str = "text"
    segment_id_field: str = "segment_id"
    document_id_field: str = "id"
    drop_segment_id_field: bool = True
    max_length: int | None = None
    length_field: str | None = None
    name: str = "document_joiner"

    def __post_init__(self): ...
    def inputs(self) -> tuple[list[str], list[str]]: ...
    def outputs(self) -> tuple[list[str], list[str]]: ...
    def _join_segments(self, group: pd.DataFrame) -> pd.DataFrame: ...
    def process(self, batch: DocumentBatch) -> DocumentBatch: ...

Import

from nemo_curator.stages.text.modules.joiner import DocumentJoiner

I/O Contract

Inputs

Name Type Required Description
separator str No The string used to join text segments (default: "\n\n")
text_field str No Column name containing text to join (default: "text")
segment_id_field str No Column name containing segment ordering IDs (default: "segment_id")
document_id_field str No Column name containing the document grouping ID (default: "id")
drop_segment_id_field bool No Whether to remove the segment_id column from output (default: True)
max_length int or None No Maximum length for joined documents; requires length_field
length_field str or None No Column name containing segment lengths; requires max_length
batch DocumentBatch Yes Input batch containing document segments to join

Outputs

Name Type Description
DocumentBatch DocumentBatch A batch with segments joined back into complete documents, with text concatenated by the separator

Usage Examples

Basic Usage

from nemo_curator.stages.text.modules.joiner import DocumentJoiner

# Join segments back into documents using double-newline separator
joiner = DocumentJoiner(separator="\n\n")
output_batch = joiner.process(segmented_batch)

With Maximum Length Constraint

from nemo_curator.stages.text.modules.joiner import DocumentJoiner

# Join segments with a max length of 4096 characters
joiner = DocumentJoiner(
    separator="\n\n",
    max_length=4096,
    length_field="char_count",
    document_id_field="doc_id",
    drop_segment_id_field=False,
)
output_batch = joiner.process(segmented_batch)

Split-Process-Join Pattern

from nemo_curator.stages.text.modules.splitter import DocumentSplitter
from nemo_curator.stages.text.modules.joiner import DocumentJoiner

# Split documents into paragraphs
splitter = DocumentSplitter(separator="\n\n")
segments = splitter.process(batch)

# ... perform per-segment processing ...

# Rejoin processed segments
joiner = DocumentJoiner(separator="\n\n")
rejoined = joiner.process(segments)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment