Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:PacktPublishing LLM Engineers Handbook Chunking And Embedding

From Leeroopedia


Concept Document chunking and vector embedding generation
Workflow Feature_Engineering
Pipeline Stage Feature Transformation
Repository PacktPublishing/LLM-Engineers-Handbook
Implemented By Implementation:PacktPublishing_LLM_Engineers_Handbook_ChunkingDispatcher_And_EmbeddingDispatcher

Overview

Chunking and Embedding is a two-stage transformation process that converts cleaned documents into vector-searchable representations for RAG (Retrieval-Augmented Generation) systems. The first stage — chunking — splits documents into smaller, semantically coherent segments. The second stage — embedding — converts each chunk into a dense vector representation using a neural embedding model.

Theory

Text Chunking

Text chunking is the process of splitting a document into smaller segments (chunks) that are suitable for retrieval. The goal is to produce chunks that are:

  • Semantically coherent — Each chunk should contain a self-contained piece of information rather than cutting across topic boundaries
  • Appropriately sized — Chunks must be small enough to fit within embedding model token limits and retrieval context windows, but large enough to carry meaningful information
  • Consistently structured — Chunks from the same document type should follow similar size and formatting conventions

Chunking strategies vary by content type:

  • Articles — Typically chunked by paragraph or section boundaries, respecting the document's natural structure
  • Social media posts — May be kept as single chunks if short, or split by sentence for longer posts
  • Code repositories — Chunked by logical units such as functions, classes, or file boundaries

Dense Vector Embedding

Dense vector embedding maps text into a high-dimensional continuous vector space where semantic similarity corresponds to geometric proximity. Given a text chunk t, an embedding model f produces a vector:

v = f(t) ∈ ℝd

where d is the embedding dimensionality (commonly 384, 768, or 1536 depending on the model).

The key property of dense embeddings is that semantically similar texts produce vectors that are close together:

cosine_similarity(v1, v2) ≈ semantic_similarity(t1, t2)

This property enables dense retrieval — given a query, we embed it into the same vector space and find the chunks whose vectors are nearest to the query vector.

Sentence-Transformers Models

The embedding models used in this pipeline are from the sentence-transformers family, which are neural models fine-tuned on sentence-pair similarity tasks. These models:

  • Accept variable-length text input (up to a maximum token limit)
  • Produce fixed-dimensional dense vectors
  • Are trained to maximize similarity between semantically related texts and minimize it between unrelated texts
  • Support efficient batch inference for processing many chunks at once

Dispatcher Pattern for Both Stages

Both chunking and embedding use the Dispatcher (Factory) pattern, routing documents to category-specific handlers. This ensures that:

  • Articles are chunked differently from code repositories
  • Embedding preprocessing can be customized per document type
  • New document types can be added by implementing a handler and registering it in the dispatcher

How It Fits in Feature Engineering

Chunking and Embedding occupy the central transformation stages in the feature engineering pipeline:

  1. Query — Raw documents loaded from MongoDB
  2. Clean — Documents normalized and sanitized
  3. Chunk (this principle) — Cleaned documents split into segments
  4. Embed (this principle) — Chunks converted to vector representations
  5. Store — Embedded chunks persisted to Qdrant

These two stages are where raw text is transformed into the mathematical representations that power similarity search in the RAG system.

Design Considerations

  • Chunk overlap — Some chunking strategies use overlapping windows to ensure that information at chunk boundaries is not lost. The overlap size is a tunable hyperparameter.
  • Chunk metadata — Each chunk retains metadata from its parent document (author, source URL, document ID) to support filtered retrieval and provenance tracking.
  • Embedding model selection — The choice of embedding model affects both the quality of retrieval and the dimensionality of stored vectors. Larger models produce better embeddings but require more storage and compute.
  • Batch processing — Embedding is typically the most compute-intensive step. Batch inference reduces overhead by processing multiple chunks in a single forward pass through the model.
  • Determinism — Both chunking and embedding should be deterministic: the same input always produces the same output. This enables reproducible pipeline runs and simplifies debugging.

Usage

Use the Chunking and Embedding pattern when:

  • Transforming cleaned documents into vector-searchable chunks for RAG retrieval
  • Building a semantic search index over a document corpus
  • Preparing training data that requires fixed-size text segments with vector representations
  • Implementing a feature engineering pipeline that bridges raw text and vector storage

Example

from llm_engineering.application.preprocessing.dispatchers import (
    ChunkingDispatcher,
    EmbeddingDispatcher,
)

# Chunk a cleaned document
chunks = ChunkingDispatcher.dispatch(cleaned_article)
print(f"Produced {len(chunks)} chunks")

# Embed each chunk
embedded_chunks = []
for chunk in chunks:
    embedded = EmbeddingDispatcher.dispatch(chunk)
    embedded_chunks.append(embedded)

# Each embedded chunk now has an 'embedding' field
print(f"Embedding dimension: {len(embedded_chunks[0].embedding)}")

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment