Implementation:NVIDIA NeMo Curator DocumentSplitter
| Knowledge Sources | |
|---|---|
| Domains | Data Curation, Text Processing, Pipeline Stages |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
The DocumentSplitter stage splits documents into segments based on a separator string, with each segment becoming a new row in the batch and a segment_id column tracking segment order.
Description
DocumentSplitter is a dataclass-based ProcessingStage that takes a batch of documents and breaks each document's text into multiple segments based on a configurable separator. Each segment becomes its own row in the output DataFrame, with all other columns from the original document replicated for each segment.
The splitting process works in several steps:
- The text column is split using pandas
str.split(separator), producing a list of segments per row. - The resulting lists are exploded into separate rows using pandas
explode(), preserving the original index. - Sequential segment IDs are assigned per original document using
groupby(level=0).cumcount(). - The original text column is replaced with the split segment text.
- The temporary column is dropped and the index is reset to sequential.
To restore the original document after splitting, ensure each document has a unique ID prior to splitting (see AddId). The DocumentJoiner stage performs the inverse operation.
Usage
Use DocumentSplitter when you need to perform per-segment processing on documents, such as paragraph-level filtering, scoring, or modification. Common patterns include splitting by paragraph ("\n\n") or sentence ("\n"), processing each segment independently, and then reassembling with DocumentJoiner.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/modules/splitter.py - Lines: 1-94
Signature
@dataclass
class DocumentSplitter(ProcessingStage[DocumentBatch, DocumentBatch]):
separator: str
text_field: str = "text"
segment_id_field: str = "segment_id"
name: str = "document_splitter"
def inputs(self) -> tuple[list[str], list[str]]: ...
def outputs(self) -> tuple[list[str], list[str]]: ...
def process(self, batch: DocumentBatch) -> DocumentBatch: ...
Import
from nemo_curator.stages.text.modules.splitter import DocumentSplitter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| separator | str | Yes | The string to split documents on (e.g., "\n\n" for paragraphs)
|
| text_field | str | No | Column name containing text to split (default: "text")
|
| segment_id_field | str | No | Column name for the assigned segment IDs (default: "segment_id")
|
| batch | DocumentBatch | Yes | The input batch of documents to split |
Outputs
| Name | Type | Description |
|---|---|---|
| DocumentBatch | DocumentBatch | A batch where each original document row has been expanded into multiple rows (one per segment), with a segment_id column indicating segment order |
Usage Examples
Split by Paragraph
from nemo_curator.stages.text.modules.splitter import DocumentSplitter
# Split documents into paragraphs (separated by double newlines)
splitter = DocumentSplitter(separator="\n\n")
segmented_batch = splitter.process(input_batch)
# A document with text="Hello\n\nWorld" becomes two rows:
# text="Hello", segment_id=0
# text="World", segment_id=1
Split by Custom Separator
from nemo_curator.stages.text.modules.splitter import DocumentSplitter
# Split on a custom separator with a custom text column
splitter = DocumentSplitter(
separator="---",
text_field="content",
segment_id_field="part_id",
)
segmented_batch = splitter.process(input_batch)
Full Split-Process-Join Workflow
from nemo_curator.stages.text.modules.add_id import AddId
from nemo_curator.stages.text.modules.splitter import DocumentSplitter
from nemo_curator.stages.text.modules.joiner import DocumentJoiner
# Step 1: Assign unique IDs
add_id = AddId(id_field="doc_id")
batch_with_ids = add_id.process(input_batch)
# Step 2: Split into segments
splitter = DocumentSplitter(separator="\n\n")
segments = splitter.process(batch_with_ids)
# Step 3: Process segments (e.g., filtering, scoring)
# ... per-segment processing here ...
# Step 4: Rejoin segments
joiner = DocumentJoiner(separator="\n\n", document_id_field="doc_id")
output_batch = joiner.process(segments)
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_DocumentJoiner - The inverse operation that joins segments back into documents
- NVIDIA_NeMo_Curator_AddId - Prerequisite stage for assigning document IDs before splitting
- NVIDIA_NeMo_Curator_DocumentBatch - The task type consumed and produced by this stage