Implementation:PacktPublishing LLM Engineers Handbook ChunkingDispatcher And EmbeddingDispatcher

Type	API Doc
API	`ChunkingDispatcher.dispatch(data_model: VectorBaseDocument) -> list[VectorBaseDocument]` and `EmbeddingDispatcher.dispatch(data_model: VectorBaseDocument) -> VectorBaseDocument`
Source	llm_engineering/application/preprocessing/dispatchers.py:L51-134
Repository	PacktPublishing/LLM-Engineers-Handbook
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_Chunking_And_Embedding

Overview

The ChunkingDispatcher and EmbeddingDispatcher are two complementary dispatcher classes that implement the chunking and embedding stages of the feature engineering pipeline. Both follow the Dispatcher (Factory) pattern, routing documents to category-specific handlers based on the document's DataCategory. Together, they transform cleaned documents into embedded chunks suitable for vector storage and similarity search.

API Signatures

ChunkingDispatcher

class ChunkingDispatcher:
    @staticmethod
    def dispatch(data_model: VectorBaseDocument) -> list[VectorBaseDocument]:

EmbeddingDispatcher

class EmbeddingDispatcher:
    @staticmethod
    def dispatch(data_model: VectorBaseDocument) -> VectorBaseDocument:

Parameters

ChunkingDispatcher.dispatch

Parameter	Type	Description
`data_model`	`VectorBaseDocument`	A cleaned document to be chunked. Concrete types include `CleanedPostDocument`, `CleanedArticleDocument`, and `CleanedRepositoryDocument`.

EmbeddingDispatcher.dispatch

Parameter	Type	Description
`data_model`	`VectorBaseDocument`	A chunk document to be embedded. Concrete types include `PostChunk`, `ArticleChunk`, and `RepositoryChunk`.

Return Values

Dispatcher	Return Type	Description
`ChunkingDispatcher`	`list[VectorBaseDocument]`	A list of chunk documents produced from the input cleaned document. Each chunk contains a segment of the original text along with metadata from the parent document.
`EmbeddingDispatcher`	`VectorBaseDocument`	The same chunk document with its `embedding` field populated — a `list[float]` dense vector representation generated by the embedding model.

Source Code

ChunkingDispatcher

class ChunkingDispatcher:
    @staticmethod
    def dispatch(data_model: VectorBaseDocument) -> list[VectorBaseDocument]:
        data_category = data_model.get_category()
        chunking_factory = {
            DataCategory.POSTS: PostChunkingHandler,
            DataCategory.ARTICLES: ArticleChunkingHandler,
            DataCategory.REPOSITORIES: RepositoryChunkingHandler,
        }
        chunking_handler = chunking_factory.get(data_category)
        if chunking_handler is None:
            raise ValueError(...)
        chunks = chunking_handler.chunk(data_model)
        return chunks

EmbeddingDispatcher

class EmbeddingDispatcher:
    @staticmethod
    def dispatch(data_model: VectorBaseDocument) -> VectorBaseDocument:
        data_category = data_model.get_category()
        embedding_factory = {
            DataCategory.POSTS: PostEmbeddingHandler,
            DataCategory.ARTICLES: ArticleEmbeddingHandler,
            DataCategory.REPOSITORIES: RepositoryEmbeddingHandler,
        }
        embedding_handler = embedding_factory.get(data_category)
        if embedding_handler is None:
            raise ValueError(...)
        embedded_chunk = embedding_handler.embed(data_model)
        return embedded_chunk

Import

from llm_engineering.application.preprocessing.dispatchers import ChunkingDispatcher, EmbeddingDispatcher

How It Works

Chunking Flow

Category detection — data_model.get_category() returns the document's DataCategory.
Handler lookup — The chunking_factory dictionary maps categories to chunking handler classes:
- DataCategory.POSTS maps to PostChunkingHandler
- DataCategory.ARTICLES maps to ArticleChunkingHandler
- DataCategory.REPOSITORIES maps to RepositoryChunkingHandler
Validation — If no handler is found, a ValueError is raised.
Chunk execution — The handler's chunk() method splits the cleaned document into a list of chunk documents, each containing a text segment and parent metadata.

Embedding Flow

Category detection — Same as chunking, determines the document category.
Handler lookup — The embedding_factory dictionary maps categories to embedding handler classes:
- DataCategory.POSTS maps to PostEmbeddingHandler
- DataCategory.ARTICLES maps to ArticleEmbeddingHandler
- DataCategory.REPOSITORIES maps to RepositoryEmbeddingHandler
Validation — If no handler is found, a ValueError is raised.
Embed execution — The handler's embed() method runs the chunk's text through a sentence-transformers model and populates the embedding field with the resulting dense vector.

Handler Summary

Chunking Handlers

Handler	Input Type	Output Type	Strategy
`PostChunkingHandler`	`CleanedPostDocument`	`list[PostChunk]`	Sentence-based splitting for short-form content
`ArticleChunkingHandler`	`CleanedArticleDocument`	`list[ArticleChunk]`	Paragraph/section-based splitting for long-form content
`RepositoryChunkingHandler`	`CleanedRepositoryDocument`	`list[RepositoryChunk]`	Logical unit splitting (functions, classes) for code content

Embedding Handlers

Handler	Input Type	Output Type	Model
`PostEmbeddingHandler`	`PostChunk`	`EmbeddedPostChunk`	sentence-transformers
`ArticleEmbeddingHandler`	`ArticleChunk`	`EmbeddedArticleChunk`	sentence-transformers
`RepositoryEmbeddingHandler`	`RepositoryChunk`	`EmbeddedRepositoryChunk`	sentence-transformers

Usage Example

from llm_engineering.application.preprocessing.dispatchers import (
    ChunkingDispatcher,
    EmbeddingDispatcher,
)
from llm_engineering.domain.base.vector import VectorBaseDocument

# Start with a cleaned document
cleaned_article = ...  # CleanedArticleDocument instance

# Stage 1: Chunk the cleaned document
chunks = ChunkingDispatcher.dispatch(cleaned_article)
print(f"Produced {len(chunks)} chunks from article")

# Stage 2: Embed each chunk
embedded_chunks = []
for chunk in chunks:
    embedded = EmbeddingDispatcher.dispatch(chunk)
    embedded_chunks.append(embedded)
    print(f"  Chunk embedded: dim={len(embedded.embedding)}")

# Stage 3: Persist to vector database
VectorBaseDocument.bulk_insert(embedded_chunks)

External Dependencies

Dependency	Purpose
sentence_transformers	Neural embedding model library; used by embedding handlers to convert text chunks into dense vectors
loguru	Structured logging for chunking and embedding progress

Design Notes

Both dispatchers are stateless (static methods), making them safe for use in parallel or distributed pipeline execution.
The one-to-many relationship in chunking (one cleaned document produces many chunks) contrasts with the one-to-one relationship in embedding (one chunk produces one embedded chunk). This asymmetry is reflected in the return types: list[VectorBaseDocument] vs. VectorBaseDocument.
The factory dictionary pattern used in both dispatchers ensures O(1) handler lookup and makes the set of supported categories explicitly visible in the code.
Separation of chunking and embedding into distinct dispatchers follows the Single Responsibility Principle and allows each stage to be independently tested, configured, and potentially parallelized.
Adding a new document category requires implementing both a chunking handler (with a chunk() method) and an embedding handler (with an embed() method), then registering both in their respective factory dictionaries.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Overview

API Signatures

ChunkingDispatcher

EmbeddingDispatcher

Parameters

ChunkingDispatcher.dispatch

EmbeddingDispatcher.dispatch

Return Values

Source Code

ChunkingDispatcher

EmbeddingDispatcher

Import

How It Works

Chunking Flow

Embedding Flow

Handler Summary

Chunking Handlers

Embedding Handlers

Usage Example

External Dependencies

Design Notes

See Also

Page Connections