Implementation:NVIDIA NeMo Curator DocumentSplitter

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Curation, Text Processing, Pipeline Stages
Last Updated	2026-02-14 00:00 GMT

Overview

The DocumentSplitter stage splits documents into segments based on a separator string, with each segment becoming a new row in the batch and a segment_id column tracking segment order.

Description

DocumentSplitter is a dataclass-based ProcessingStage that takes a batch of documents and breaks each document's text into multiple segments based on a configurable separator. Each segment becomes its own row in the output DataFrame, with all other columns from the original document replicated for each segment.

The splitting process works in several steps:

The text column is split using pandas str.split(separator), producing a list of segments per row.
The resulting lists are exploded into separate rows using pandas explode(), preserving the original index.
Sequential segment IDs are assigned per original document using groupby(level=0).cumcount().
The original text column is replaced with the split segment text.
The temporary column is dropped and the index is reset to sequential.

To restore the original document after splitting, ensure each document has a unique ID prior to splitting (see AddId). The DocumentJoiner stage performs the inverse operation.

Usage

Use DocumentSplitter when you need to perform per-segment processing on documents, such as paragraph-level filtering, scoring, or modification. Common patterns include splitting by paragraph ("\n\n") or sentence ("\n"), processing each segment independently, and then reassembling with DocumentJoiner.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/modules/splitter.py
Lines: 1-94

Signature

@dataclass
class DocumentSplitter(ProcessingStage[DocumentBatch, DocumentBatch]):
    separator: str
    text_field: str = "text"
    segment_id_field: str = "segment_id"
    name: str = "document_splitter"

    def inputs(self) -> tuple[list[str], list[str]]: ...
    def outputs(self) -> tuple[list[str], list[str]]: ...
    def process(self, batch: DocumentBatch) -> DocumentBatch: ...

Import

from nemo_curator.stages.text.modules.splitter import DocumentSplitter

I/O Contract

Inputs

Name	Type	Required	Description
separator	str	Yes	The string to split documents on (e.g., `"\n\n"` for paragraphs)
text_field	str	No	Column name containing text to split (default: `"text"`)
segment_id_field	str	No	Column name for the assigned segment IDs (default: `"segment_id"`)
batch	DocumentBatch	Yes	The input batch of documents to split

Outputs

Name	Type	Description
DocumentBatch	DocumentBatch	A batch where each original document row has been expanded into multiple rows (one per segment), with a segment_id column indicating segment order

Usage Examples

Split by Paragraph

from nemo_curator.stages.text.modules.splitter import DocumentSplitter

# Split documents into paragraphs (separated by double newlines)
splitter = DocumentSplitter(separator="\n\n")
segmented_batch = splitter.process(input_batch)

# A document with text="Hello\n\nWorld" becomes two rows:
#   text="Hello", segment_id=0
#   text="World", segment_id=1

Split by Custom Separator

from nemo_curator.stages.text.modules.splitter import DocumentSplitter

# Split on a custom separator with a custom text column
splitter = DocumentSplitter(
    separator="---",
    text_field="content",
    segment_id_field="part_id",
)
segmented_batch = splitter.process(input_batch)

Full Split-Process-Join Workflow

from nemo_curator.stages.text.modules.add_id import AddId
from nemo_curator.stages.text.modules.splitter import DocumentSplitter
from nemo_curator.stages.text.modules.joiner import DocumentJoiner

# Step 1: Assign unique IDs
add_id = AddId(id_field="doc_id")
batch_with_ids = add_id.process(input_batch)

# Step 2: Split into segments
splitter = DocumentSplitter(separator="\n\n")
segments = splitter.process(batch_with_ids)

# Step 3: Process segments (e.g., filtering, scoring)
# ... per-segment processing here ...

# Step 4: Rejoin segments
joiner = DocumentJoiner(separator="\n\n", document_id_field="doc_id")
output_batch = joiner.process(segments)

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_DocumentJoiner - The inverse operation that joins segments back into documents
NVIDIA_NeMo_Curator_AddId - Prerequisite stage for assigning document IDs before splitting
NVIDIA_NeMo_Curator_DocumentBatch - The task type consumed and produced by this stage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment