Principle:Langchain ai Langchain Document Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Preprocessing, NLP |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
A preprocessing step that splits large documents into smaller, overlapping chunks suitable for embedding and retrieval.
Description
Raw documents are often too long to embed as single vectors (embedding models have context limits) and too long to include entirely in LLM context windows. Document preparation splits them into chunks that:
- Fit within embedding model context limits
- Preserve semantic coherence (split at natural boundaries like paragraphs)
- Overlap slightly to avoid losing context at chunk boundaries
LangChain's Document class wraps text content with metadata, and text splitters produce lists of Document objects ready for vector store ingestion.
Usage
Apply text splitting after loading documents and before adding them to a vector store. Choose chunk size based on the embedding model's context window and the retrieval granularity needed.
Theoretical Basis
Recursive character splitting attempts to split at the most semantically meaningful boundary:
# Abstract algorithm (not real code)
separators = ["\n\n", "\n", " ", ""] # Ordered by preference
for separator in separators:
chunks = text.split(separator)
if all(len(chunk) <= chunk_size for chunk in chunks):
return chunks
# If chunks are still too large, recurse with next separator