Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Marker Inc Korea AutoRAG Document Chunking

From Leeroopedia
Knowledge Sources
Domains Natural Language Processing, Information Retrieval, Text Segmentation
Last Updated 2026-02-12 00:00 GMT

Overview

Document chunking is the process of splitting parsed documents into smaller, semantically coherent passages that are suitable for indexing and retrieval in a RAG pipeline.

Description

After documents have been parsed into raw text, the resulting content is typically too long to serve as effective retrieval units. Chunking addresses this by dividing documents into passages of manageable size while attempting to preserve semantic coherence. The choice of chunking strategy and its hyperparameters directly affects retrieval quality: chunks that are too large dilute relevance signals, while chunks that are too small lose contextual meaning.

Several established chunking strategies are commonly used. Token-based chunking splits text at fixed token counts, offering simplicity and predictability. Sentence-based chunking respects sentence boundaries, preserving grammatical completeness. Recursive character chunking attempts to split at progressively smaller structural boundaries (paragraphs, then sentences, then words) until the desired size is achieved. Semantic chunking uses embedding similarity to detect topic shifts and places chunk boundaries at points where the content changes significantly. Each approach offers different trade-offs between computational cost, boundary quality, and consistency of chunk sizes.

Two critical hyperparameters govern chunking behavior: chunk size (the target number of tokens or characters per chunk) and overlap (the number of tokens or characters shared between consecutive chunks). Overlap ensures that information near chunk boundaries is not lost, improving retrieval recall at the cost of increased index size. The optimal values for these parameters depend on the nature of the documents, the embedding model being used, and the downstream retrieval method.

Usage

Document chunking is applied as the second step of the evaluation data creation workflow, immediately after parsing. The chunker accepts a parsed DataFrame (with columns for texts, path, page, and last_modified_datetime) and produces a chunked DataFrame with unique document IDs, passage contents, source paths, start/end character indices, and metadata. The configuration is provided via YAML, allowing multiple chunking strategies to be tested.

Theoretical Basis

The generic chunking algorithm can be expressed as follows:

INPUT:  Parsed DataFrame R with columns (texts, path, page, last_modified_datetime)
OUTPUT: Chunked DataFrame C with columns (doc_id, contents, path, start_end_idx, metadata)

For each row r_i in R:
    text = r_i.texts
    chunks = ChunkStrategy(text, chunk_size, overlap)
    For each chunk c_j in chunks:
        doc_id = generate_uuid()
        contents = c_j.text
        path = r_i.path
        start_end_idx = (c_j.start_char, c_j.end_char)  # relative to original text
        metadata = {"page": r_i.page, ...}
        Append row to C

Recursive Character Splitting is one of the most commonly used strategies:

function RecursiveCharSplit(text, separators, chunk_size, overlap):
    if length(text) <= chunk_size:
        return [text]
    sep = first separator in separators that occurs in text
    segments = split(text, sep)
    chunks = []
    current = ""
    for segment in segments:
        if length(current + segment) > chunk_size:
            chunks.append(current)
            current = last 'overlap' characters of current + segment
        else:
            current += segment
    chunks.append(current)
    return chunks

Chunk quality metrics commonly considered include:

Metric Description
Average chunk length Mean number of tokens per chunk; should match the retrieval model's optimal input length
Length variance Low variance indicates consistent chunk sizes
Boundary quality Fraction of chunk boundaries that align with sentence or paragraph boundaries
Overlap ratio Proportion of duplicated content across adjacent chunks

The start_end_idx field recorded for each chunk is critical for corpus remapping, as it enables the system to match chunks from a new chunking strategy back to the same raw text regions used to generate QA pairs.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment