Implementation:NVIDIA NeMo Curator DocumentJoiner

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Curation, Text Processing, Pipeline Stages
Last Updated	2026-02-14 00:00 GMT

Overview

The DocumentJoiner stage reconstructs documents from their segments by joining rows that share a common document ID, serving as the inverse operation of DocumentSplitter.

Description

DocumentJoiner is a dataclass-based ProcessingStage that groups document segments by their document_id_field, sorts them by segment_id_field, and concatenates the text using a configurable separator. For non-text columns, the first occurrence within each group is preserved.

The stage supports two joining modes:

Simple join (no max_length): Groups segments by document ID, sorts by segment ID, joins text fields with the separator using pandas groupby().agg(), and takes the first value for all other columns.

Length-constrained join (with max_length): Uses the _join_segments() method to greedily accumulate segments up to a maximum length. When adding the next segment would exceed max_length (accounting for separator length), the current accumulation is committed as a new joined segment and a fresh accumulation begins. This produces multiple output rows per document when the total content exceeds the length limit.

Validation in __post_init__ enforces that max_length and length_field must both be specified or both be omitted. The drop_segment_id_field parameter (default: True) controls whether the segment ID column is removed from the output.

Important: All segments belonging to the same document must be present within a single DocumentBatch. Segments from the same document split across multiple batches will not be joined together.

Usage

Use DocumentJoiner after performing segment-level operations (such as per-paragraph filtering, scoring, or modification) to reassemble the processed segments back into complete documents. It is designed to work in tandem with DocumentSplitter. The max_length option is useful when downstream stages require documents to stay within a certain size limit.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/modules/joiner.py
Lines: 1-193

Signature

@dataclass
class DocumentJoiner(ProcessingStage[DocumentBatch, DocumentBatch]):
    separator: str = "\n\n"
    text_field: str = "text"
    segment_id_field: str = "segment_id"
    document_id_field: str = "id"
    drop_segment_id_field: bool = True
    max_length: int | None = None
    length_field: str | None = None
    name: str = "document_joiner"

    def __post_init__(self): ...
    def inputs(self) -> tuple[list[str], list[str]]: ...
    def outputs(self) -> tuple[list[str], list[str]]: ...
    def _join_segments(self, group: pd.DataFrame) -> pd.DataFrame: ...
    def process(self, batch: DocumentBatch) -> DocumentBatch: ...

Import

from nemo_curator.stages.text.modules.joiner import DocumentJoiner

I/O Contract

Inputs

Name	Type	Required	Description
separator	str	No	The string used to join text segments (default: `"\n\n"`)
text_field	str	No	Column name containing text to join (default: `"text"`)
segment_id_field	str	No	Column name containing segment ordering IDs (default: `"segment_id"`)
document_id_field	str	No	Column name containing the document grouping ID (default: `"id"`)
drop_segment_id_field	bool	No	Whether to remove the segment_id column from output (default: True)
max_length	int or None	No	Maximum length for joined documents; requires length_field
length_field	str or None	No	Column name containing segment lengths; requires max_length
batch	DocumentBatch	Yes	Input batch containing document segments to join

Outputs

Name	Type	Description
DocumentBatch	DocumentBatch	A batch with segments joined back into complete documents, with text concatenated by the separator

Usage Examples

Basic Usage

from nemo_curator.stages.text.modules.joiner import DocumentJoiner

# Join segments back into documents using double-newline separator
joiner = DocumentJoiner(separator="\n\n")
output_batch = joiner.process(segmented_batch)

With Maximum Length Constraint

from nemo_curator.stages.text.modules.joiner import DocumentJoiner

# Join segments with a max length of 4096 characters
joiner = DocumentJoiner(
    separator="\n\n",
    max_length=4096,
    length_field="char_count",
    document_id_field="doc_id",
    drop_segment_id_field=False,
)
output_batch = joiner.process(segmented_batch)

Split-Process-Join Pattern

from nemo_curator.stages.text.modules.splitter import DocumentSplitter
from nemo_curator.stages.text.modules.joiner import DocumentJoiner

# Split documents into paragraphs
splitter = DocumentSplitter(separator="\n\n")
segments = splitter.process(batch)

# ... perform per-segment processing ...

# Rejoin processed segments
joiner = DocumentJoiner(separator="\n\n")
rejoined = joiner.process(segments)

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_DocumentSplitter - The inverse operation that splits documents into segments
NVIDIA_NeMo_Curator_AddId - Prerequisite stage for assigning document IDs before splitting
NVIDIA_NeMo_Curator_DocumentBatch - The task type consumed and produced by this stage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment