Implementation:PacktPublishing LLM Engineers Handbook CleaningDispatcher Dispatch
Appearance
| Type | API Doc |
|---|---|
| API | CleaningDispatcher.dispatch(data_model: NoSQLBaseDocument) -> VectorBaseDocument
|
| Source | llm_engineering/application/preprocessing/dispatchers.py:L18-48 |
| Repository | PacktPublishing/LLM-Engineers-Handbook |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Document_Cleaning |
Overview
The CleaningDispatcher.dispatch static method routes a raw document to the appropriate category-specific cleaning handler, returning a cleaned document ready for chunking and embedding. It implements the Dispatcher (Factory) pattern, using the document's category to look up the correct handler from an internal registry.
API Signature
@staticmethod
def dispatch(data_model: NoSQLBaseDocument) -> VectorBaseDocument:
Parameters
| Parameter | Type | Description |
|---|---|---|
data_model |
NoSQLBaseDocument |
A raw document retrieved from MongoDB. Concrete types include ArticleDocument, PostDocument, and RepositoryDocument. The document must implement get_category() to identify its type.
|
Return Value
| Type | Description |
|---|---|
VectorBaseDocument |
A cleaned document suitable for storage in the vector database. Concrete types include CleanedPostDocument, CleanedArticleDocument, and CleanedRepositoryDocument. The cleaned document preserves the semantic content while removing noise and artifacts.
|
Source Code
class CleaningDispatcher:
@staticmethod
def dispatch(data_model: NoSQLBaseDocument) -> VectorBaseDocument:
data_category = data_model.get_category()
cleaning_factory = {
DataCategory.POSTS: PostCleaningHandler,
DataCategory.ARTICLES: ArticleCleaningHandler,
DataCategory.REPOSITORIES: RepositoryCleaningHandler,
}
cleaning_handler = cleaning_factory.get(data_category)
if cleaning_handler is None:
raise ValueError(f"No cleaning handler found for category: {data_category}")
clean_model = cleaning_handler.clean(data_model)
return clean_model
Import
from llm_engineering.application.preprocessing.dispatchers import CleaningDispatcher
How It Works
- Category detection — The method calls
data_model.get_category()to determine the document'sDataCategory(one ofPOSTS,ARTICLES, orREPOSITORIES). - Handler lookup — A factory dictionary maps each
DataCategoryto its corresponding cleaning handler class:DataCategory.POSTSmaps toPostCleaningHandlerDataCategory.ARTICLESmaps toArticleCleaningHandlerDataCategory.REPOSITORIESmaps toRepositoryCleaningHandler
- Validation — If no handler is found for the category, a
ValueErroris raised with a descriptive error message. - Cleaning execution — The resolved handler's
clean()method is invoked with the raw document, returning a cleanedVectorBaseDocument.
Cleaning Handlers
Each handler implements a clean() static method with category-specific logic:
| Handler | Input Type | Output Type | Cleaning Focus |
|---|---|---|---|
PostCleaningHandler |
PostDocument |
CleanedPostDocument |
Social media formatting, hashtags, mentions, platform artifacts |
ArticleCleaningHandler |
ArticleDocument |
CleanedArticleDocument |
HTML stripping, navigation removal, whitespace normalization |
RepositoryCleaningHandler |
RepositoryDocument |
CleanedRepositoryDocument |
Build artifacts, binary references, non-code content |
Usage Example
from llm_engineering.application.preprocessing.dispatchers import CleaningDispatcher
from llm_engineering.domain.documents import ArticleDocument, PostDocument
# Clean an article
raw_article = ArticleDocument.bulk_find(author_id=author_uuid)[0]
cleaned_article = CleaningDispatcher.dispatch(raw_article)
print(type(cleaned_article)) # CleanedArticleDocument
# Clean a post
raw_post = PostDocument.bulk_find(author_id=author_uuid)[0]
cleaned_post = CleaningDispatcher.dispatch(raw_post)
print(type(cleaned_post)) # CleanedPostDocument
External Dependencies
| Dependency | Purpose |
|---|---|
| loguru | Structured logging for the cleaning process |
Design Notes
- The method is a staticmethod since it does not depend on instance or class state; it operates purely on the input document.
- The factory dictionary is constructed inline within the method. This keeps the mapping co-located with the dispatch logic, making it easy to see all supported categories at a glance.
- The type transformation from
NoSQLBaseDocumenttoVectorBaseDocumentmarks a domain boundary crossing — the document transitions from the MongoDB storage domain to the Qdrant vector store domain. - Adding support for a new document category requires only creating a new cleaning handler class and adding an entry to the
cleaning_factorydictionary.
See Also
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment