Principle:Run llama Llama index Text Chunking
| Knowledge Sources | |
|---|---|
| Domains | Data_Preprocessing, RAG, NLP |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Text chunking (also called text splitting) is the process of dividing large documents into smaller, semantically coherent pieces for embedding and retrieval in RAG systems.
Description
Raw documents are typically too long for embedding models and LLM context windows. Text chunking addresses this by splitting documents into nodes (LlamaIndex's term for document chunks) that:
- Fit within embedding model token limits
- Preserve semantic coherence by splitting at natural boundaries (sentences, paragraphs)
- Maintain optional overlap between consecutive chunks to prevent information loss at boundaries
LlamaIndex provides multiple splitting strategies, each with different tradeoffs:
- Sentence-aware splitting: Splits at sentence boundaries using NLP tokenizers, preserving complete thoughts. This is the recommended default approach.
- Fixed-size splitting: Splits at exact token or character counts regardless of content boundaries. Simpler but may break mid-sentence.
- Semantic splitting: Groups sentences by embedding similarity. Higher quality but more expensive.
Usage
Choose a splitting strategy based on your content type and quality requirements. For most use cases, sentence-aware splitting (SentenceSplitter) provides the best balance of quality and performance.
Theoretical Basis
Chunk Size Tradeoffs
The chunk_size parameter controls the maximum size of each chunk. This involves a fundamental tradeoff:
- Smaller chunks (128-256 tokens): More precise retrieval but may lose surrounding context. Better for fact-based QA.
- Larger chunks (512-1024 tokens): More context per chunk but less precise retrieval. Better for summarization tasks.
Chunk Overlap
The chunk_overlap parameter controls how many tokens are shared between consecutive chunks:
# Conceptual illustration of overlap
# chunk_size=100, chunk_overlap=20
# Chunk 1: tokens[0:100]
# Chunk 2: tokens[80:180] <- overlaps with chunk 1 by 20 tokens
# Chunk 3: tokens[160:260] <- overlaps with chunk 2 by 20 tokens
Overlap ensures that information near chunk boundaries is not lost. A typical overlap is 10-20% of chunk size.
Sentence-Aware Splitting Algorithm
Sentence-aware splitters follow a hierarchical approach:
- Split text into sentences using an NLP tokenizer
- Combine consecutive sentences into chunks up to the chunk_size limit
- If a single sentence exceeds chunk_size, fall back to secondary splitting (e.g., by paragraph separator or regex)
- Apply overlap by including trailing sentences from the previous chunk