Principle:Ucbepic Docetl Document Splitting
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Processing |
| Last Updated | 2026-02-08 01:40 GMT |
Overview
A text segmentation principle that divides long documents into smaller chunks suitable for LLM context windows, maintaining document identity through unique identifiers and ordering.
Description
Document Splitting partitions long text fields into manageable chunks using either token-based or delimiter-based methods. Each chunk preserves a link to its parent document through a unique document ID and sequential chunk numbering, enabling downstream operations to reassemble results.
Two splitting methods are supported:
- Token-based: Split at fixed token boundaries (e.g., every 2000 tokens)
- Delimiter-based: Split at natural text boundaries (e.g., paragraphs, sections)
Usage
Apply this principle when document text exceeds the LLM context window. Choose token-based splitting for uniform chunk sizes or delimiter-based splitting when natural text boundaries should be preserved.
Theoretical Basis
Document splitting preserves document identity through metadata:
- Segmentation: Divide text into chunks by tokens or delimiters
- Identity Preservation: Assign UUID to each source document
- Ordering: Number chunks sequentially within each document
- Metadata Propagation: Copy original document fields to each chunk