Principle:Unstructured IO Unstructured Basic Chunking
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, RAG, Text_Splitting |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
A sequential text splitting strategy that combines consecutive document elements into chunks of a target size without regard to document structure boundaries.
Description
Basic chunking is the simplest chunking strategy. It processes elements sequentially, accumulating them into chunks until a size threshold is reached, then starts a new chunk. Unlike section-aware chunking, basic chunking does not consider document structure (titles, sections) when deciding where to split.
This approach works well when document structure is flat or irrelevant to the downstream task. It guarantees consistent chunk sizes, which is important for embedding models and retrieval systems that are sensitive to input length.
The strategy supports both character-based and token-based size limits, optional overlap between consecutive chunks for context continuity, and preservation of original element references in chunk metadata.
Usage
Use this principle when you need uniform chunk sizes and document structure is not important for your retrieval task. It is appropriate for flat documents (plain text, transcripts, chat logs) or when the downstream embedding model has a fixed context window. For documents with clear section structure, prefer section-aware chunking (chunk_by_title).
Theoretical Basis
Basic chunking uses a greedy sequential fill algorithm:
# Abstract basic chunking algorithm
chunks = []
current_chunk = []
current_size = 0
for element in elements:
element_size = len(str(element))
if current_size + element_size > soft_max and current_chunk:
chunks.append(merge(current_chunk))
# Apply overlap from end of previous chunk
current_chunk = get_overlap(current_chunk, overlap_size)
current_size = size_of(current_chunk)
current_chunk.append(element)
current_size += element_size
if current_chunk:
chunks.append(merge(current_chunk))
Key parameters:
- hard_max (max_characters): Absolute maximum chunk size. Elements exceeding this are split mid-text.
- soft_max (new_after_n_chars): Target size after which a new chunk starts at the next element boundary.
- overlap: Number of trailing characters from the previous chunk prepended to the next chunk for context continuity.