Implementation:Unstructured IO Unstructured Chunk Elements
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, RAG, Text_Splitting |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for basic sequential chunking of document elements provided by the Unstructured library.
Description
The chunk_elements function implements the basic chunking strategy. It processes elements sequentially, combining them into CompositeElement chunks based on character or token size limits. It supports inter-chunk overlap for context continuity and can preserve references to original source elements in chunk metadata.
Usage
Import this function when you need to split partitioned elements into uniform-sized chunks for embedding or retrieval. Use this instead of chunk_by_title when document section structure is not relevant to your use case.
Code Reference
Source Location
- Repository: unstructured
- File: unstructured/chunking/basic.py
- Lines: 24-92
Signature
def chunk_elements(
elements: Iterable[Element],
*,
include_orig_elements: Optional[bool] = None,
max_characters: Optional[int] = None,
max_tokens: Optional[int] = None,
new_after_n_chars: Optional[int] = None,
new_after_n_tokens: Optional[int] = None,
overlap: Optional[int] = None,
overlap_all: Optional[bool] = None,
tokenizer: Optional[str] = None,
) -> list[Element]:
"""Chunk elements using basic sequential strategy.
Args:
elements: Iterable of Element objects to chunk.
include_orig_elements: Preserve original elements in chunk metadata.
max_characters: Hard maximum chunk size in characters (default 500).
max_tokens: Hard maximum chunk size in tokens.
new_after_n_chars: Soft max to trigger new chunk at next element boundary.
new_after_n_tokens: Soft max in tokens.
overlap: Character overlap between consecutive chunks.
overlap_all: Apply overlap to all chunks (not just text chunks).
tokenizer: Tokenizer name for token-based chunking.
Returns:
List of chunked elements (CompositeElement for text, TableChunk for tables).
"""
Import
from unstructured.chunking.basic import chunk_elements
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| elements | Iterable[Element] | Yes | Elements from partitioning |
| max_characters | None | No | Hard max chunk size in characters (default 500) |
| new_after_n_chars | None | No | Soft max to start new chunk |
| max_tokens | None | No | Hard max chunk size in tokens |
| overlap | None | No | Character overlap between chunks |
| include_orig_elements | None | No | Store original elements in metadata.orig_elements |
| tokenizer | None | No | Tokenizer for token-based sizing |
Outputs
| Name | Type | Description |
|---|---|---|
| return | list[Element] | Chunked elements: CompositeElement for merged text, TableChunk for split tables. Each has metadata.orig_elements if include_orig_elements is True. |
Usage Examples
Basic Character-Based Chunking
from unstructured.partition.auto import partition
from unstructured.chunking.basic import chunk_elements
elements = partition(filename="report.pdf")
chunks = chunk_elements(
elements,
max_characters=1000,
new_after_n_chars=800,
overlap=100,
)
for chunk in chunks:
print(f"Type: {type(chunk).__name__}, Length: {len(str(chunk))}")
Token-Based Chunking
from unstructured.chunking.basic import chunk_elements
chunks = chunk_elements(
elements,
max_tokens=256,
new_after_n_tokens=200,
tokenizer="cl100k_base",
)
Via Dispatch Function
from unstructured.chunking.dispatch import chunk
chunks = chunk(
elements,
chunking_strategy="basic",
max_characters=500,
overlap=50,
include_orig_elements=True,
)