Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ucbepic Docetl Document Splitting

From Leeroopedia


Knowledge Sources
Domains NLP, Text_Processing
Last Updated 2026-02-08 01:40 GMT

Overview

A text segmentation principle that divides long documents into smaller chunks suitable for LLM context windows, maintaining document identity through unique identifiers and ordering.

Description

Document Splitting partitions long text fields into manageable chunks using either token-based or delimiter-based methods. Each chunk preserves a link to its parent document through a unique document ID and sequential chunk numbering, enabling downstream operations to reassemble results.

Two splitting methods are supported:

  • Token-based: Split at fixed token boundaries (e.g., every 2000 tokens)
  • Delimiter-based: Split at natural text boundaries (e.g., paragraphs, sections)

Usage

Apply this principle when document text exceeds the LLM context window. Choose token-based splitting for uniform chunk sizes or delimiter-based splitting when natural text boundaries should be preserved.

Theoretical Basis

Document splitting preserves document identity through metadata:

  1. Segmentation: Divide text into chunks by tokens or delimiters
  2. Identity Preservation: Assign UUID to each source document
  3. Ordering: Number chunks sequentially within each document
  4. Metadata Propagation: Copy original document fields to each chunk

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment