Principle:PacktPublishing LLM Engineers Handbook Chunking And Embedding
| Concept | Document chunking and vector embedding generation |
|---|---|
| Workflow | Feature_Engineering |
| Pipeline Stage | Feature Transformation |
| Repository | PacktPublishing/LLM-Engineers-Handbook |
| Implemented By | Implementation:PacktPublishing_LLM_Engineers_Handbook_ChunkingDispatcher_And_EmbeddingDispatcher |
Overview
Chunking and Embedding is a two-stage transformation process that converts cleaned documents into vector-searchable representations for RAG (Retrieval-Augmented Generation) systems. The first stage — chunking — splits documents into smaller, semantically coherent segments. The second stage — embedding — converts each chunk into a dense vector representation using a neural embedding model.
Theory
Text Chunking
Text chunking is the process of splitting a document into smaller segments (chunks) that are suitable for retrieval. The goal is to produce chunks that are:
- Semantically coherent — Each chunk should contain a self-contained piece of information rather than cutting across topic boundaries
- Appropriately sized — Chunks must be small enough to fit within embedding model token limits and retrieval context windows, but large enough to carry meaningful information
- Consistently structured — Chunks from the same document type should follow similar size and formatting conventions
Chunking strategies vary by content type:
- Articles — Typically chunked by paragraph or section boundaries, respecting the document's natural structure
- Social media posts — May be kept as single chunks if short, or split by sentence for longer posts
- Code repositories — Chunked by logical units such as functions, classes, or file boundaries
Dense Vector Embedding
Dense vector embedding maps text into a high-dimensional continuous vector space where semantic similarity corresponds to geometric proximity. Given a text chunk t, an embedding model f produces a vector:
v = f(t) ∈ ℝd
where d is the embedding dimensionality (commonly 384, 768, or 1536 depending on the model).
The key property of dense embeddings is that semantically similar texts produce vectors that are close together:
cosine_similarity(v1, v2) ≈ semantic_similarity(t1, t2)
This property enables dense retrieval — given a query, we embed it into the same vector space and find the chunks whose vectors are nearest to the query vector.
Sentence-Transformers Models
The embedding models used in this pipeline are from the sentence-transformers family, which are neural models fine-tuned on sentence-pair similarity tasks. These models:
- Accept variable-length text input (up to a maximum token limit)
- Produce fixed-dimensional dense vectors
- Are trained to maximize similarity between semantically related texts and minimize it between unrelated texts
- Support efficient batch inference for processing many chunks at once
Dispatcher Pattern for Both Stages
Both chunking and embedding use the Dispatcher (Factory) pattern, routing documents to category-specific handlers. This ensures that:
- Articles are chunked differently from code repositories
- Embedding preprocessing can be customized per document type
- New document types can be added by implementing a handler and registering it in the dispatcher
How It Fits in Feature Engineering
Chunking and Embedding occupy the central transformation stages in the feature engineering pipeline:
- Query — Raw documents loaded from MongoDB
- Clean — Documents normalized and sanitized
- Chunk (this principle) — Cleaned documents split into segments
- Embed (this principle) — Chunks converted to vector representations
- Store — Embedded chunks persisted to Qdrant
These two stages are where raw text is transformed into the mathematical representations that power similarity search in the RAG system.
Design Considerations
- Chunk overlap — Some chunking strategies use overlapping windows to ensure that information at chunk boundaries is not lost. The overlap size is a tunable hyperparameter.
- Chunk metadata — Each chunk retains metadata from its parent document (author, source URL, document ID) to support filtered retrieval and provenance tracking.
- Embedding model selection — The choice of embedding model affects both the quality of retrieval and the dimensionality of stored vectors. Larger models produce better embeddings but require more storage and compute.
- Batch processing — Embedding is typically the most compute-intensive step. Batch inference reduces overhead by processing multiple chunks in a single forward pass through the model.
- Determinism — Both chunking and embedding should be deterministic: the same input always produces the same output. This enables reproducible pipeline runs and simplifies debugging.
Usage
Use the Chunking and Embedding pattern when:
- Transforming cleaned documents into vector-searchable chunks for RAG retrieval
- Building a semantic search index over a document corpus
- Preparing training data that requires fixed-size text segments with vector representations
- Implementing a feature engineering pipeline that bridges raw text and vector storage
Example
from llm_engineering.application.preprocessing.dispatchers import (
ChunkingDispatcher,
EmbeddingDispatcher,
)
# Chunk a cleaned document
chunks = ChunkingDispatcher.dispatch(cleaned_article)
print(f"Produced {len(chunks)} chunks")
# Embed each chunk
embedded_chunks = []
for chunk in chunks:
embedded = EmbeddingDispatcher.dispatch(chunk)
embedded_chunks.append(embedded)
# Each embedded chunk now has an 'embedding' field
print(f"Embedding dimension: {len(embedded_chunks[0].embedding)}")
See Also
- Implementation:PacktPublishing_LLM_Engineers_Handbook_ChunkingDispatcher_And_EmbeddingDispatcher
- Principle:PacktPublishing_LLM_Engineers_Handbook_Document_Cleaning
- Principle:PacktPublishing_LLM_Engineers_Handbook_Vector_Storage
- Heuristic:PacktPublishing_LLM_Engineers_Handbook_Chunking_Strategy_By_Content_Type