Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook ChunkingDispatcher And EmbeddingDispatcher

From Leeroopedia


Type API Doc
API ChunkingDispatcher.dispatch(data_model: VectorBaseDocument) -> list[VectorBaseDocument] and EmbeddingDispatcher.dispatch(data_model: VectorBaseDocument) -> VectorBaseDocument
Source llm_engineering/application/preprocessing/dispatchers.py:L51-134
Repository PacktPublishing/LLM-Engineers-Handbook
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Chunking_And_Embedding

Overview

The ChunkingDispatcher and EmbeddingDispatcher are two complementary dispatcher classes that implement the chunking and embedding stages of the feature engineering pipeline. Both follow the Dispatcher (Factory) pattern, routing documents to category-specific handlers based on the document's DataCategory. Together, they transform cleaned documents into embedded chunks suitable for vector storage and similarity search.

API Signatures

ChunkingDispatcher

class ChunkingDispatcher:
    @staticmethod
    def dispatch(data_model: VectorBaseDocument) -> list[VectorBaseDocument]:

EmbeddingDispatcher

class EmbeddingDispatcher:
    @staticmethod
    def dispatch(data_model: VectorBaseDocument) -> VectorBaseDocument:

Parameters

ChunkingDispatcher.dispatch

Parameter Type Description
data_model VectorBaseDocument A cleaned document to be chunked. Concrete types include CleanedPostDocument, CleanedArticleDocument, and CleanedRepositoryDocument.

EmbeddingDispatcher.dispatch

Parameter Type Description
data_model VectorBaseDocument A chunk document to be embedded. Concrete types include PostChunk, ArticleChunk, and RepositoryChunk.

Return Values

Dispatcher Return Type Description
ChunkingDispatcher list[VectorBaseDocument] A list of chunk documents produced from the input cleaned document. Each chunk contains a segment of the original text along with metadata from the parent document.
EmbeddingDispatcher VectorBaseDocument The same chunk document with its embedding field populated — a list[float] dense vector representation generated by the embedding model.

Source Code

ChunkingDispatcher

class ChunkingDispatcher:
    @staticmethod
    def dispatch(data_model: VectorBaseDocument) -> list[VectorBaseDocument]:
        data_category = data_model.get_category()
        chunking_factory = {
            DataCategory.POSTS: PostChunkingHandler,
            DataCategory.ARTICLES: ArticleChunkingHandler,
            DataCategory.REPOSITORIES: RepositoryChunkingHandler,
        }
        chunking_handler = chunking_factory.get(data_category)
        if chunking_handler is None:
            raise ValueError(...)
        chunks = chunking_handler.chunk(data_model)
        return chunks

EmbeddingDispatcher

class EmbeddingDispatcher:
    @staticmethod
    def dispatch(data_model: VectorBaseDocument) -> VectorBaseDocument:
        data_category = data_model.get_category()
        embedding_factory = {
            DataCategory.POSTS: PostEmbeddingHandler,
            DataCategory.ARTICLES: ArticleEmbeddingHandler,
            DataCategory.REPOSITORIES: RepositoryEmbeddingHandler,
        }
        embedding_handler = embedding_factory.get(data_category)
        if embedding_handler is None:
            raise ValueError(...)
        embedded_chunk = embedding_handler.embed(data_model)
        return embedded_chunk

Import

from llm_engineering.application.preprocessing.dispatchers import ChunkingDispatcher, EmbeddingDispatcher

How It Works

Chunking Flow

  1. Category detectiondata_model.get_category() returns the document's DataCategory.
  2. Handler lookup — The chunking_factory dictionary maps categories to chunking handler classes:
    • DataCategory.POSTS maps to PostChunkingHandler
    • DataCategory.ARTICLES maps to ArticleChunkingHandler
    • DataCategory.REPOSITORIES maps to RepositoryChunkingHandler
  3. Validation — If no handler is found, a ValueError is raised.
  4. Chunk execution — The handler's chunk() method splits the cleaned document into a list of chunk documents, each containing a text segment and parent metadata.

Embedding Flow

  1. Category detection — Same as chunking, determines the document category.
  2. Handler lookup — The embedding_factory dictionary maps categories to embedding handler classes:
    • DataCategory.POSTS maps to PostEmbeddingHandler
    • DataCategory.ARTICLES maps to ArticleEmbeddingHandler
    • DataCategory.REPOSITORIES maps to RepositoryEmbeddingHandler
  3. Validation — If no handler is found, a ValueError is raised.
  4. Embed execution — The handler's embed() method runs the chunk's text through a sentence-transformers model and populates the embedding field with the resulting dense vector.

Handler Summary

Chunking Handlers

Handler Input Type Output Type Strategy
PostChunkingHandler CleanedPostDocument list[PostChunk] Sentence-based splitting for short-form content
ArticleChunkingHandler CleanedArticleDocument list[ArticleChunk] Paragraph/section-based splitting for long-form content
RepositoryChunkingHandler CleanedRepositoryDocument list[RepositoryChunk] Logical unit splitting (functions, classes) for code content

Embedding Handlers

Handler Input Type Output Type Model
PostEmbeddingHandler PostChunk EmbeddedPostChunk sentence-transformers
ArticleEmbeddingHandler ArticleChunk EmbeddedArticleChunk sentence-transformers
RepositoryEmbeddingHandler RepositoryChunk EmbeddedRepositoryChunk sentence-transformers

Usage Example

from llm_engineering.application.preprocessing.dispatchers import (
    ChunkingDispatcher,
    EmbeddingDispatcher,
)
from llm_engineering.domain.base.vector import VectorBaseDocument

# Start with a cleaned document
cleaned_article = ...  # CleanedArticleDocument instance

# Stage 1: Chunk the cleaned document
chunks = ChunkingDispatcher.dispatch(cleaned_article)
print(f"Produced {len(chunks)} chunks from article")

# Stage 2: Embed each chunk
embedded_chunks = []
for chunk in chunks:
    embedded = EmbeddingDispatcher.dispatch(chunk)
    embedded_chunks.append(embedded)
    print(f"  Chunk embedded: dim={len(embedded.embedding)}")

# Stage 3: Persist to vector database
VectorBaseDocument.bulk_insert(embedded_chunks)

External Dependencies

Dependency Purpose
sentence_transformers Neural embedding model library; used by embedding handlers to convert text chunks into dense vectors
loguru Structured logging for chunking and embedding progress

Design Notes

  • Both dispatchers are stateless (static methods), making them safe for use in parallel or distributed pipeline execution.
  • The one-to-many relationship in chunking (one cleaned document produces many chunks) contrasts with the one-to-one relationship in embedding (one chunk produces one embedded chunk). This asymmetry is reflected in the return types: list[VectorBaseDocument] vs. VectorBaseDocument.
  • The factory dictionary pattern used in both dispatchers ensures O(1) handler lookup and makes the set of supported categories explicitly visible in the code.
  • Separation of chunking and embedding into distinct dispatchers follows the Single Responsibility Principle and allows each stage to be independently tested, configured, and potentially parallelized.
  • Adding a new document category requires implementing both a chunking handler (with a chunk() method) and an embedding handler (with an embed() method), then registering both in their respective factory dictionaries.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment