Principle:Marker Inc Korea AutoRAG Document Chunking
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Information Retrieval, Text Segmentation |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Document chunking is the process of splitting parsed documents into smaller, semantically coherent passages that are suitable for indexing and retrieval in a RAG pipeline.
Description
After documents have been parsed into raw text, the resulting content is typically too long to serve as effective retrieval units. Chunking addresses this by dividing documents into passages of manageable size while attempting to preserve semantic coherence. The choice of chunking strategy and its hyperparameters directly affects retrieval quality: chunks that are too large dilute relevance signals, while chunks that are too small lose contextual meaning.
Several established chunking strategies are commonly used. Token-based chunking splits text at fixed token counts, offering simplicity and predictability. Sentence-based chunking respects sentence boundaries, preserving grammatical completeness. Recursive character chunking attempts to split at progressively smaller structural boundaries (paragraphs, then sentences, then words) until the desired size is achieved. Semantic chunking uses embedding similarity to detect topic shifts and places chunk boundaries at points where the content changes significantly. Each approach offers different trade-offs between computational cost, boundary quality, and consistency of chunk sizes.
Two critical hyperparameters govern chunking behavior: chunk size (the target number of tokens or characters per chunk) and overlap (the number of tokens or characters shared between consecutive chunks). Overlap ensures that information near chunk boundaries is not lost, improving retrieval recall at the cost of increased index size. The optimal values for these parameters depend on the nature of the documents, the embedding model being used, and the downstream retrieval method.
Usage
Document chunking is applied as the second step of the evaluation data creation workflow, immediately after parsing. The chunker accepts a parsed DataFrame (with columns for texts, path, page, and last_modified_datetime) and produces a chunked DataFrame with unique document IDs, passage contents, source paths, start/end character indices, and metadata. The configuration is provided via YAML, allowing multiple chunking strategies to be tested.
Theoretical Basis
The generic chunking algorithm can be expressed as follows:
INPUT: Parsed DataFrame R with columns (texts, path, page, last_modified_datetime)
OUTPUT: Chunked DataFrame C with columns (doc_id, contents, path, start_end_idx, metadata)
For each row r_i in R:
text = r_i.texts
chunks = ChunkStrategy(text, chunk_size, overlap)
For each chunk c_j in chunks:
doc_id = generate_uuid()
contents = c_j.text
path = r_i.path
start_end_idx = (c_j.start_char, c_j.end_char) # relative to original text
metadata = {"page": r_i.page, ...}
Append row to C
Recursive Character Splitting is one of the most commonly used strategies:
function RecursiveCharSplit(text, separators, chunk_size, overlap):
if length(text) <= chunk_size:
return [text]
sep = first separator in separators that occurs in text
segments = split(text, sep)
chunks = []
current = ""
for segment in segments:
if length(current + segment) > chunk_size:
chunks.append(current)
current = last 'overlap' characters of current + segment
else:
current += segment
chunks.append(current)
return chunks
Chunk quality metrics commonly considered include:
| Metric | Description |
|---|---|
| Average chunk length | Mean number of tokens per chunk; should match the retrieval model's optimal input length |
| Length variance | Low variance indicates consistent chunk sizes |
| Boundary quality | Fraction of chunk boundaries that align with sentence or paragraph boundaries |
| Overlap ratio | Proportion of duplicated content across adjacent chunks |
The start_end_idx field recorded for each chunk is critical for corpus remapping, as it enables the system to match chunks from a new chunking strategy back to the same raw text regions used to generate QA pairs.