Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator DocumentSplitter

From Leeroopedia
Knowledge Sources
Domains Data Curation, Text Processing, Pipeline Stages
Last Updated 2026-02-14 00:00 GMT

Overview

The DocumentSplitter stage splits documents into segments based on a separator string, with each segment becoming a new row in the batch and a segment_id column tracking segment order.

Description

DocumentSplitter is a dataclass-based ProcessingStage that takes a batch of documents and breaks each document's text into multiple segments based on a configurable separator. Each segment becomes its own row in the output DataFrame, with all other columns from the original document replicated for each segment.

The splitting process works in several steps:

  1. The text column is split using pandas str.split(separator), producing a list of segments per row.
  2. The resulting lists are exploded into separate rows using pandas explode(), preserving the original index.
  3. Sequential segment IDs are assigned per original document using groupby(level=0).cumcount().
  4. The original text column is replaced with the split segment text.
  5. The temporary column is dropped and the index is reset to sequential.

To restore the original document after splitting, ensure each document has a unique ID prior to splitting (see AddId). The DocumentJoiner stage performs the inverse operation.

Usage

Use DocumentSplitter when you need to perform per-segment processing on documents, such as paragraph-level filtering, scoring, or modification. Common patterns include splitting by paragraph ("\n\n") or sentence ("\n"), processing each segment independently, and then reassembling with DocumentJoiner.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/modules/splitter.py
  • Lines: 1-94

Signature

@dataclass
class DocumentSplitter(ProcessingStage[DocumentBatch, DocumentBatch]):
    separator: str
    text_field: str = "text"
    segment_id_field: str = "segment_id"
    name: str = "document_splitter"

    def inputs(self) -> tuple[list[str], list[str]]: ...
    def outputs(self) -> tuple[list[str], list[str]]: ...
    def process(self, batch: DocumentBatch) -> DocumentBatch: ...

Import

from nemo_curator.stages.text.modules.splitter import DocumentSplitter

I/O Contract

Inputs

Name Type Required Description
separator str Yes The string to split documents on (e.g., "\n\n" for paragraphs)
text_field str No Column name containing text to split (default: "text")
segment_id_field str No Column name for the assigned segment IDs (default: "segment_id")
batch DocumentBatch Yes The input batch of documents to split

Outputs

Name Type Description
DocumentBatch DocumentBatch A batch where each original document row has been expanded into multiple rows (one per segment), with a segment_id column indicating segment order

Usage Examples

Split by Paragraph

from nemo_curator.stages.text.modules.splitter import DocumentSplitter

# Split documents into paragraphs (separated by double newlines)
splitter = DocumentSplitter(separator="\n\n")
segmented_batch = splitter.process(input_batch)

# A document with text="Hello\n\nWorld" becomes two rows:
#   text="Hello", segment_id=0
#   text="World", segment_id=1

Split by Custom Separator

from nemo_curator.stages.text.modules.splitter import DocumentSplitter

# Split on a custom separator with a custom text column
splitter = DocumentSplitter(
    separator="---",
    text_field="content",
    segment_id_field="part_id",
)
segmented_batch = splitter.process(input_batch)

Full Split-Process-Join Workflow

from nemo_curator.stages.text.modules.add_id import AddId
from nemo_curator.stages.text.modules.splitter import DocumentSplitter
from nemo_curator.stages.text.modules.joiner import DocumentJoiner

# Step 1: Assign unique IDs
add_id = AddId(id_field="doc_id")
batch_with_ids = add_id.process(input_batch)

# Step 2: Split into segments
splitter = DocumentSplitter(separator="\n\n")
segments = splitter.process(batch_with_ids)

# Step 3: Process segments (e.g., filtering, scoring)
# ... per-segment processing here ...

# Step 4: Rejoin segments
joiner = DocumentJoiner(separator="\n\n", document_id_field="doc_id")
output_batch = joiner.process(segments)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment