Principle:PacktPublishing LLM Engineers Handbook Chunking And Embedding

Concept	Document chunking and vector embedding generation
Workflow	Feature_Engineering
Pipeline Stage	Feature Transformation
Repository	PacktPublishing/LLM-Engineers-Handbook
Implemented By	Implementation:PacktPublishing_LLM_Engineers_Handbook_ChunkingDispatcher_And_EmbeddingDispatcher

Overview

Chunking and Embedding is a two-stage transformation process that converts cleaned documents into vector-searchable representations for RAG (Retrieval-Augmented Generation) systems. The first stage — chunking — splits documents into smaller, semantically coherent segments. The second stage — embedding — converts each chunk into a dense vector representation using a neural embedding model.

Theory

Text Chunking

Text chunking is the process of splitting a document into smaller segments (chunks) that are suitable for retrieval. The goal is to produce chunks that are:

Semantically coherent — Each chunk should contain a self-contained piece of information rather than cutting across topic boundaries
Appropriately sized — Chunks must be small enough to fit within embedding model token limits and retrieval context windows, but large enough to carry meaningful information
Consistently structured — Chunks from the same document type should follow similar size and formatting conventions

Chunking strategies vary by content type:

Articles — Typically chunked by paragraph or section boundaries, respecting the document's natural structure
Social media posts — May be kept as single chunks if short, or split by sentence for longer posts
Code repositories — Chunked by logical units such as functions, classes, or file boundaries

Dense Vector Embedding

Dense vector embedding maps text into a high-dimensional continuous vector space where semantic similarity corresponds to geometric proximity. Given a text chunk t, an embedding model f produces a vector:

v = f(t) ∈ ℝ^d

where d is the embedding dimensionality (commonly 384, 768, or 1536 depending on the model).

The key property of dense embeddings is that semantically similar texts produce vectors that are close together:

cosine_similarity(v₁, v₂) ≈ semantic_similarity(t₁, t₂)

This property enables dense retrieval — given a query, we embed it into the same vector space and find the chunks whose vectors are nearest to the query vector.

Sentence-Transformers Models

The embedding models used in this pipeline are from the sentence-transformers family, which are neural models fine-tuned on sentence-pair similarity tasks. These models:

Accept variable-length text input (up to a maximum token limit)
Produce fixed-dimensional dense vectors
Are trained to maximize similarity between semantically related texts and minimize it between unrelated texts
Support efficient batch inference for processing many chunks at once

Dispatcher Pattern for Both Stages

Both chunking and embedding use the Dispatcher (Factory) pattern, routing documents to category-specific handlers. This ensures that:

Articles are chunked differently from code repositories
Embedding preprocessing can be customized per document type
New document types can be added by implementing a handler and registering it in the dispatcher

How It Fits in Feature Engineering

Chunking and Embedding occupy the central transformation stages in the feature engineering pipeline:

Query — Raw documents loaded from MongoDB
Clean — Documents normalized and sanitized
Chunk (this principle) — Cleaned documents split into segments
Embed (this principle) — Chunks converted to vector representations
Store — Embedded chunks persisted to Qdrant

These two stages are where raw text is transformed into the mathematical representations that power similarity search in the RAG system.

Design Considerations

Chunk overlap — Some chunking strategies use overlapping windows to ensure that information at chunk boundaries is not lost. The overlap size is a tunable hyperparameter.
Chunk metadata — Each chunk retains metadata from its parent document (author, source URL, document ID) to support filtered retrieval and provenance tracking.
Embedding model selection — The choice of embedding model affects both the quality of retrieval and the dimensionality of stored vectors. Larger models produce better embeddings but require more storage and compute.
Batch processing — Embedding is typically the most compute-intensive step. Batch inference reduces overhead by processing multiple chunks in a single forward pass through the model.
Determinism — Both chunking and embedding should be deterministic: the same input always produces the same output. This enables reproducible pipeline runs and simplifies debugging.

Usage

Use the Chunking and Embedding pattern when:

Transforming cleaned documents into vector-searchable chunks for RAG retrieval
Building a semantic search index over a document corpus
Preparing training data that requires fixed-size text segments with vector representations
Implementing a feature engineering pipeline that bridges raw text and vector storage

Example

from llm_engineering.application.preprocessing.dispatchers import (
    ChunkingDispatcher,
    EmbeddingDispatcher,
)

# Chunk a cleaned document
chunks = ChunkingDispatcher.dispatch(cleaned_article)
print(f"Produced {len(chunks)} chunks")

# Embed each chunk
embedded_chunks = []
for chunk in chunks:
    embedded = EmbeddingDispatcher.dispatch(chunk)
    embedded_chunks.append(embedded)

# Each embedded chunk now has an 'embedding' field
print(f"Embedding dimension: {len(embedded_chunks[0].embedding)}")

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment