Principle:Unstructured IO Unstructured Basic Chunking

Knowledge Sources	Unstructured Unstructured Docs Chunking for RAG
Domains	Document_Processing, RAG, Text_Splitting
Last Updated	2026-02-12 00:00 GMT

Overview

A sequential text splitting strategy that combines consecutive document elements into chunks of a target size without regard to document structure boundaries.

Description

Basic chunking is the simplest chunking strategy. It processes elements sequentially, accumulating them into chunks until a size threshold is reached, then starts a new chunk. Unlike section-aware chunking, basic chunking does not consider document structure (titles, sections) when deciding where to split.

This approach works well when document structure is flat or irrelevant to the downstream task. It guarantees consistent chunk sizes, which is important for embedding models and retrieval systems that are sensitive to input length.

The strategy supports both character-based and token-based size limits, optional overlap between consecutive chunks for context continuity, and preservation of original element references in chunk metadata.

Usage

Use this principle when you need uniform chunk sizes and document structure is not important for your retrieval task. It is appropriate for flat documents (plain text, transcripts, chat logs) or when the downstream embedding model has a fixed context window. For documents with clear section structure, prefer section-aware chunking (chunk_by_title).

Theoretical Basis

Basic chunking uses a greedy sequential fill algorithm:

# Abstract basic chunking algorithm
chunks = []
current_chunk = []
current_size = 0

for element in elements:
    element_size = len(str(element))
    if current_size + element_size > soft_max and current_chunk:
        chunks.append(merge(current_chunk))
        # Apply overlap from end of previous chunk
        current_chunk = get_overlap(current_chunk, overlap_size)
        current_size = size_of(current_chunk)
    current_chunk.append(element)
    current_size += element_size

if current_chunk:
    chunks.append(merge(current_chunk))

Key parameters:

hard_max (max_characters): Absolute maximum chunk size. Elements exceeding this are split mid-text.
soft_max (new_after_n_chars): Target size after which a new chunk starts at the next element boundary.
overlap: Number of trailing characters from the previous chunk prepended to the next chunk for context continuity.

Related Pages

Implemented By

Implementation:Unstructured_IO_Unstructured_Chunk_Elements

Uses Heuristic

Heuristic:Unstructured_IO_Unstructured_Chunk_Size_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment