Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unstructured IO Unstructured Chunk Elements

From Leeroopedia
Knowledge Sources
Domains Document_Processing, RAG, Text_Splitting
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for basic sequential chunking of document elements provided by the Unstructured library.

Description

The chunk_elements function implements the basic chunking strategy. It processes elements sequentially, combining them into CompositeElement chunks based on character or token size limits. It supports inter-chunk overlap for context continuity and can preserve references to original source elements in chunk metadata.

Usage

Import this function when you need to split partitioned elements into uniform-sized chunks for embedding or retrieval. Use this instead of chunk_by_title when document section structure is not relevant to your use case.

Code Reference

Source Location

  • Repository: unstructured
  • File: unstructured/chunking/basic.py
  • Lines: 24-92

Signature

def chunk_elements(
    elements: Iterable[Element],
    *,
    include_orig_elements: Optional[bool] = None,
    max_characters: Optional[int] = None,
    max_tokens: Optional[int] = None,
    new_after_n_chars: Optional[int] = None,
    new_after_n_tokens: Optional[int] = None,
    overlap: Optional[int] = None,
    overlap_all: Optional[bool] = None,
    tokenizer: Optional[str] = None,
) -> list[Element]:
    """Chunk elements using basic sequential strategy.

    Args:
        elements: Iterable of Element objects to chunk.
        include_orig_elements: Preserve original elements in chunk metadata.
        max_characters: Hard maximum chunk size in characters (default 500).
        max_tokens: Hard maximum chunk size in tokens.
        new_after_n_chars: Soft max to trigger new chunk at next element boundary.
        new_after_n_tokens: Soft max in tokens.
        overlap: Character overlap between consecutive chunks.
        overlap_all: Apply overlap to all chunks (not just text chunks).
        tokenizer: Tokenizer name for token-based chunking.
    Returns:
        List of chunked elements (CompositeElement for text, TableChunk for tables).
    """

Import

from unstructured.chunking.basic import chunk_elements

I/O Contract

Inputs

Name Type Required Description
elements Iterable[Element] Yes Elements from partitioning
max_characters None No Hard max chunk size in characters (default 500)
new_after_n_chars None No Soft max to start new chunk
max_tokens None No Hard max chunk size in tokens
overlap None No Character overlap between chunks
include_orig_elements None No Store original elements in metadata.orig_elements
tokenizer None No Tokenizer for token-based sizing

Outputs

Name Type Description
return list[Element] Chunked elements: CompositeElement for merged text, TableChunk for split tables. Each has metadata.orig_elements if include_orig_elements is True.

Usage Examples

Basic Character-Based Chunking

from unstructured.partition.auto import partition
from unstructured.chunking.basic import chunk_elements

elements = partition(filename="report.pdf")

chunks = chunk_elements(
    elements,
    max_characters=1000,
    new_after_n_chars=800,
    overlap=100,
)

for chunk in chunks:
    print(f"Type: {type(chunk).__name__}, Length: {len(str(chunk))}")

Token-Based Chunking

from unstructured.chunking.basic import chunk_elements

chunks = chunk_elements(
    elements,
    max_tokens=256,
    new_after_n_tokens=200,
    tokenizer="cl100k_base",
)

Via Dispatch Function

from unstructured.chunking.dispatch import chunk

chunks = chunk(
    elements,
    chunking_strategy="basic",
    max_characters=500,
    overlap=50,
    include_orig_elements=True,
)

Related Pages

Implements Principle

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment