Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Text Chunking

From Leeroopedia
Knowledge Sources
Domains Data_Preprocessing, RAG, NLP
Last Updated 2026-02-11 00:00 GMT

Overview

Text chunking (also called text splitting) is the process of dividing large documents into smaller, semantically coherent pieces for embedding and retrieval in RAG systems.

Description

Raw documents are typically too long for embedding models and LLM context windows. Text chunking addresses this by splitting documents into nodes (LlamaIndex's term for document chunks) that:

  • Fit within embedding model token limits
  • Preserve semantic coherence by splitting at natural boundaries (sentences, paragraphs)
  • Maintain optional overlap between consecutive chunks to prevent information loss at boundaries

LlamaIndex provides multiple splitting strategies, each with different tradeoffs:

  • Sentence-aware splitting: Splits at sentence boundaries using NLP tokenizers, preserving complete thoughts. This is the recommended default approach.
  • Fixed-size splitting: Splits at exact token or character counts regardless of content boundaries. Simpler but may break mid-sentence.
  • Semantic splitting: Groups sentences by embedding similarity. Higher quality but more expensive.

Usage

Choose a splitting strategy based on your content type and quality requirements. For most use cases, sentence-aware splitting (SentenceSplitter) provides the best balance of quality and performance.

Theoretical Basis

Chunk Size Tradeoffs

The chunk_size parameter controls the maximum size of each chunk. This involves a fundamental tradeoff:

  • Smaller chunks (128-256 tokens): More precise retrieval but may lose surrounding context. Better for fact-based QA.
  • Larger chunks (512-1024 tokens): More context per chunk but less precise retrieval. Better for summarization tasks.

Chunk Overlap

The chunk_overlap parameter controls how many tokens are shared between consecutive chunks:

# Conceptual illustration of overlap
# chunk_size=100, chunk_overlap=20

# Chunk 1: tokens[0:100]
# Chunk 2: tokens[80:180]   <- overlaps with chunk 1 by 20 tokens
# Chunk 3: tokens[160:260]  <- overlaps with chunk 2 by 20 tokens

Overlap ensures that information near chunk boundaries is not lost. A typical overlap is 10-20% of chunk size.

Sentence-Aware Splitting Algorithm

Sentence-aware splitters follow a hierarchical approach:

  1. Split text into sentences using an NLP tokenizer
  2. Combine consecutive sentences into chunks up to the chunk_size limit
  3. If a single sentence exceeds chunk_size, fall back to secondary splitting (e.g., by paragraph separator or regex)
  4. Apply overlap by including trailing sentences from the previous chunk

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment