Implementation:PacktPublishing LLM Engineers Handbook ChunkingDispatcher And EmbeddingDispatcher
Appearance
| Type | API Doc |
|---|---|
| API | ChunkingDispatcher.dispatch(data_model: VectorBaseDocument) -> list[VectorBaseDocument] and EmbeddingDispatcher.dispatch(data_model: VectorBaseDocument) -> VectorBaseDocument
|
| Source | llm_engineering/application/preprocessing/dispatchers.py:L51-134 |
| Repository | PacktPublishing/LLM-Engineers-Handbook |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Chunking_And_Embedding |
Overview
The ChunkingDispatcher and EmbeddingDispatcher are two complementary dispatcher classes that implement the chunking and embedding stages of the feature engineering pipeline. Both follow the Dispatcher (Factory) pattern, routing documents to category-specific handlers based on the document's DataCategory. Together, they transform cleaned documents into embedded chunks suitable for vector storage and similarity search.
API Signatures
ChunkingDispatcher
class ChunkingDispatcher:
@staticmethod
def dispatch(data_model: VectorBaseDocument) -> list[VectorBaseDocument]:
EmbeddingDispatcher
class EmbeddingDispatcher:
@staticmethod
def dispatch(data_model: VectorBaseDocument) -> VectorBaseDocument:
Parameters
ChunkingDispatcher.dispatch
| Parameter | Type | Description |
|---|---|---|
data_model |
VectorBaseDocument |
A cleaned document to be chunked. Concrete types include CleanedPostDocument, CleanedArticleDocument, and CleanedRepositoryDocument.
|
EmbeddingDispatcher.dispatch
| Parameter | Type | Description |
|---|---|---|
data_model |
VectorBaseDocument |
A chunk document to be embedded. Concrete types include PostChunk, ArticleChunk, and RepositoryChunk.
|
Return Values
| Dispatcher | Return Type | Description |
|---|---|---|
ChunkingDispatcher |
list[VectorBaseDocument] |
A list of chunk documents produced from the input cleaned document. Each chunk contains a segment of the original text along with metadata from the parent document. |
EmbeddingDispatcher |
VectorBaseDocument |
The same chunk document with its embedding field populated — a list[float] dense vector representation generated by the embedding model.
|
Source Code
ChunkingDispatcher
class ChunkingDispatcher:
@staticmethod
def dispatch(data_model: VectorBaseDocument) -> list[VectorBaseDocument]:
data_category = data_model.get_category()
chunking_factory = {
DataCategory.POSTS: PostChunkingHandler,
DataCategory.ARTICLES: ArticleChunkingHandler,
DataCategory.REPOSITORIES: RepositoryChunkingHandler,
}
chunking_handler = chunking_factory.get(data_category)
if chunking_handler is None:
raise ValueError(...)
chunks = chunking_handler.chunk(data_model)
return chunks
EmbeddingDispatcher
class EmbeddingDispatcher:
@staticmethod
def dispatch(data_model: VectorBaseDocument) -> VectorBaseDocument:
data_category = data_model.get_category()
embedding_factory = {
DataCategory.POSTS: PostEmbeddingHandler,
DataCategory.ARTICLES: ArticleEmbeddingHandler,
DataCategory.REPOSITORIES: RepositoryEmbeddingHandler,
}
embedding_handler = embedding_factory.get(data_category)
if embedding_handler is None:
raise ValueError(...)
embedded_chunk = embedding_handler.embed(data_model)
return embedded_chunk
Import
from llm_engineering.application.preprocessing.dispatchers import ChunkingDispatcher, EmbeddingDispatcher
How It Works
Chunking Flow
- Category detection —
data_model.get_category()returns the document'sDataCategory. - Handler lookup — The
chunking_factorydictionary maps categories to chunking handler classes:DataCategory.POSTSmaps toPostChunkingHandlerDataCategory.ARTICLESmaps toArticleChunkingHandlerDataCategory.REPOSITORIESmaps toRepositoryChunkingHandler
- Validation — If no handler is found, a
ValueErroris raised. - Chunk execution — The handler's
chunk()method splits the cleaned document into a list of chunk documents, each containing a text segment and parent metadata.
Embedding Flow
- Category detection — Same as chunking, determines the document category.
- Handler lookup — The
embedding_factorydictionary maps categories to embedding handler classes:DataCategory.POSTSmaps toPostEmbeddingHandlerDataCategory.ARTICLESmaps toArticleEmbeddingHandlerDataCategory.REPOSITORIESmaps toRepositoryEmbeddingHandler
- Validation — If no handler is found, a
ValueErroris raised. - Embed execution — The handler's
embed()method runs the chunk's text through a sentence-transformers model and populates theembeddingfield with the resulting dense vector.
Handler Summary
Chunking Handlers
| Handler | Input Type | Output Type | Strategy |
|---|---|---|---|
PostChunkingHandler |
CleanedPostDocument |
list[PostChunk] |
Sentence-based splitting for short-form content |
ArticleChunkingHandler |
CleanedArticleDocument |
list[ArticleChunk] |
Paragraph/section-based splitting for long-form content |
RepositoryChunkingHandler |
CleanedRepositoryDocument |
list[RepositoryChunk] |
Logical unit splitting (functions, classes) for code content |
Embedding Handlers
| Handler | Input Type | Output Type | Model |
|---|---|---|---|
PostEmbeddingHandler |
PostChunk |
EmbeddedPostChunk |
sentence-transformers |
ArticleEmbeddingHandler |
ArticleChunk |
EmbeddedArticleChunk |
sentence-transformers |
RepositoryEmbeddingHandler |
RepositoryChunk |
EmbeddedRepositoryChunk |
sentence-transformers |
Usage Example
from llm_engineering.application.preprocessing.dispatchers import (
ChunkingDispatcher,
EmbeddingDispatcher,
)
from llm_engineering.domain.base.vector import VectorBaseDocument
# Start with a cleaned document
cleaned_article = ... # CleanedArticleDocument instance
# Stage 1: Chunk the cleaned document
chunks = ChunkingDispatcher.dispatch(cleaned_article)
print(f"Produced {len(chunks)} chunks from article")
# Stage 2: Embed each chunk
embedded_chunks = []
for chunk in chunks:
embedded = EmbeddingDispatcher.dispatch(chunk)
embedded_chunks.append(embedded)
print(f" Chunk embedded: dim={len(embedded.embedding)}")
# Stage 3: Persist to vector database
VectorBaseDocument.bulk_insert(embedded_chunks)
External Dependencies
| Dependency | Purpose |
|---|---|
| sentence_transformers | Neural embedding model library; used by embedding handlers to convert text chunks into dense vectors |
| loguru | Structured logging for chunking and embedding progress |
Design Notes
- Both dispatchers are stateless (static methods), making them safe for use in parallel or distributed pipeline execution.
- The one-to-many relationship in chunking (one cleaned document produces many chunks) contrasts with the one-to-one relationship in embedding (one chunk produces one embedded chunk). This asymmetry is reflected in the return types:
list[VectorBaseDocument]vs.VectorBaseDocument. - The factory dictionary pattern used in both dispatchers ensures O(1) handler lookup and makes the set of supported categories explicitly visible in the code.
- Separation of chunking and embedding into distinct dispatchers follows the Single Responsibility Principle and allows each stage to be independently tested, configured, and potentially parallelized.
- Adding a new document category requires implementing both a chunking handler (with a
chunk()method) and an embedding handler (with anembed()method), then registering both in their respective factory dictionaries.
See Also
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment