Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Unstructured IO Unstructured Basic Chunking

From Leeroopedia
Knowledge Sources
Domains Document_Processing, RAG, Text_Splitting
Last Updated 2026-02-12 00:00 GMT

Overview

A sequential text splitting strategy that combines consecutive document elements into chunks of a target size without regard to document structure boundaries.

Description

Basic chunking is the simplest chunking strategy. It processes elements sequentially, accumulating them into chunks until a size threshold is reached, then starts a new chunk. Unlike section-aware chunking, basic chunking does not consider document structure (titles, sections) when deciding where to split.

This approach works well when document structure is flat or irrelevant to the downstream task. It guarantees consistent chunk sizes, which is important for embedding models and retrieval systems that are sensitive to input length.

The strategy supports both character-based and token-based size limits, optional overlap between consecutive chunks for context continuity, and preservation of original element references in chunk metadata.

Usage

Use this principle when you need uniform chunk sizes and document structure is not important for your retrieval task. It is appropriate for flat documents (plain text, transcripts, chat logs) or when the downstream embedding model has a fixed context window. For documents with clear section structure, prefer section-aware chunking (chunk_by_title).

Theoretical Basis

Basic chunking uses a greedy sequential fill algorithm:

# Abstract basic chunking algorithm
chunks = []
current_chunk = []
current_size = 0

for element in elements:
    element_size = len(str(element))
    if current_size + element_size > soft_max and current_chunk:
        chunks.append(merge(current_chunk))
        # Apply overlap from end of previous chunk
        current_chunk = get_overlap(current_chunk, overlap_size)
        current_size = size_of(current_chunk)
    current_chunk.append(element)
    current_size += element_size

if current_chunk:
    chunks.append(merge(current_chunk))

Key parameters:

  • hard_max (max_characters): Absolute maximum chunk size. Elements exceeding this are split mid-text.
  • soft_max (new_after_n_chars): Target size after which a new chunk starts at the next element boundary.
  • overlap: Number of trailing characters from the previous chunk prepended to the next chunk for context continuity.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment