Implementation:NVIDIA NeMo Curator DocumentJoiner
| Knowledge Sources | |
|---|---|
| Domains | Data Curation, Text Processing, Pipeline Stages |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
The DocumentJoiner stage reconstructs documents from their segments by joining rows that share a common document ID, serving as the inverse operation of DocumentSplitter.
Description
DocumentJoiner is a dataclass-based ProcessingStage that groups document segments by their document_id_field, sorts them by segment_id_field, and concatenates the text using a configurable separator. For non-text columns, the first occurrence within each group is preserved.
The stage supports two joining modes:
Simple join (no max_length): Groups segments by document ID, sorts by segment ID, joins text fields with the separator using pandas groupby().agg(), and takes the first value for all other columns.
Length-constrained join (with max_length): Uses the _join_segments() method to greedily accumulate segments up to a maximum length. When adding the next segment would exceed max_length (accounting for separator length), the current accumulation is committed as a new joined segment and a fresh accumulation begins. This produces multiple output rows per document when the total content exceeds the length limit.
Validation in __post_init__ enforces that max_length and length_field must both be specified or both be omitted. The drop_segment_id_field parameter (default: True) controls whether the segment ID column is removed from the output.
Important: All segments belonging to the same document must be present within a single DocumentBatch. Segments from the same document split across multiple batches will not be joined together.
Usage
Use DocumentJoiner after performing segment-level operations (such as per-paragraph filtering, scoring, or modification) to reassemble the processed segments back into complete documents. It is designed to work in tandem with DocumentSplitter. The max_length option is useful when downstream stages require documents to stay within a certain size limit.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/modules/joiner.py - Lines: 1-193
Signature
@dataclass
class DocumentJoiner(ProcessingStage[DocumentBatch, DocumentBatch]):
separator: str = "\n\n"
text_field: str = "text"
segment_id_field: str = "segment_id"
document_id_field: str = "id"
drop_segment_id_field: bool = True
max_length: int | None = None
length_field: str | None = None
name: str = "document_joiner"
def __post_init__(self): ...
def inputs(self) -> tuple[list[str], list[str]]: ...
def outputs(self) -> tuple[list[str], list[str]]: ...
def _join_segments(self, group: pd.DataFrame) -> pd.DataFrame: ...
def process(self, batch: DocumentBatch) -> DocumentBatch: ...
Import
from nemo_curator.stages.text.modules.joiner import DocumentJoiner
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| separator | str | No | The string used to join text segments (default: "\n\n")
|
| text_field | str | No | Column name containing text to join (default: "text")
|
| segment_id_field | str | No | Column name containing segment ordering IDs (default: "segment_id")
|
| document_id_field | str | No | Column name containing the document grouping ID (default: "id")
|
| drop_segment_id_field | bool | No | Whether to remove the segment_id column from output (default: True) |
| max_length | int or None | No | Maximum length for joined documents; requires length_field |
| length_field | str or None | No | Column name containing segment lengths; requires max_length |
| batch | DocumentBatch | Yes | Input batch containing document segments to join |
Outputs
| Name | Type | Description |
|---|---|---|
| DocumentBatch | DocumentBatch | A batch with segments joined back into complete documents, with text concatenated by the separator |
Usage Examples
Basic Usage
from nemo_curator.stages.text.modules.joiner import DocumentJoiner
# Join segments back into documents using double-newline separator
joiner = DocumentJoiner(separator="\n\n")
output_batch = joiner.process(segmented_batch)
With Maximum Length Constraint
from nemo_curator.stages.text.modules.joiner import DocumentJoiner
# Join segments with a max length of 4096 characters
joiner = DocumentJoiner(
separator="\n\n",
max_length=4096,
length_field="char_count",
document_id_field="doc_id",
drop_segment_id_field=False,
)
output_batch = joiner.process(segmented_batch)
Split-Process-Join Pattern
from nemo_curator.stages.text.modules.splitter import DocumentSplitter
from nemo_curator.stages.text.modules.joiner import DocumentJoiner
# Split documents into paragraphs
splitter = DocumentSplitter(separator="\n\n")
segments = splitter.process(batch)
# ... perform per-segment processing ...
# Rejoin processed segments
joiner = DocumentJoiner(separator="\n\n")
rejoined = joiner.process(segments)
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_DocumentSplitter - The inverse operation that splits documents into segments
- NVIDIA_NeMo_Curator_AddId - Prerequisite stage for assigning document IDs before splitting
- NVIDIA_NeMo_Curator_DocumentBatch - The task type consumed and produced by this stage