Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook CleaningDispatcher Dispatch

From Leeroopedia


Type API Doc
API CleaningDispatcher.dispatch(data_model: NoSQLBaseDocument) -> VectorBaseDocument
Source llm_engineering/application/preprocessing/dispatchers.py:L18-48
Repository PacktPublishing/LLM-Engineers-Handbook
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Document_Cleaning

Overview

The CleaningDispatcher.dispatch static method routes a raw document to the appropriate category-specific cleaning handler, returning a cleaned document ready for chunking and embedding. It implements the Dispatcher (Factory) pattern, using the document's category to look up the correct handler from an internal registry.

API Signature

@staticmethod
def dispatch(data_model: NoSQLBaseDocument) -> VectorBaseDocument:

Parameters

Parameter Type Description
data_model NoSQLBaseDocument A raw document retrieved from MongoDB. Concrete types include ArticleDocument, PostDocument, and RepositoryDocument. The document must implement get_category() to identify its type.

Return Value

Type Description
VectorBaseDocument A cleaned document suitable for storage in the vector database. Concrete types include CleanedPostDocument, CleanedArticleDocument, and CleanedRepositoryDocument. The cleaned document preserves the semantic content while removing noise and artifacts.

Source Code

class CleaningDispatcher:
    @staticmethod
    def dispatch(data_model: NoSQLBaseDocument) -> VectorBaseDocument:
        data_category = data_model.get_category()
        cleaning_factory = {
            DataCategory.POSTS: PostCleaningHandler,
            DataCategory.ARTICLES: ArticleCleaningHandler,
            DataCategory.REPOSITORIES: RepositoryCleaningHandler,
        }
        cleaning_handler = cleaning_factory.get(data_category)
        if cleaning_handler is None:
            raise ValueError(f"No cleaning handler found for category: {data_category}")
        clean_model = cleaning_handler.clean(data_model)
        return clean_model

Import

from llm_engineering.application.preprocessing.dispatchers import CleaningDispatcher

How It Works

  1. Category detection — The method calls data_model.get_category() to determine the document's DataCategory (one of POSTS, ARTICLES, or REPOSITORIES).
  2. Handler lookup — A factory dictionary maps each DataCategory to its corresponding cleaning handler class:
    • DataCategory.POSTS maps to PostCleaningHandler
    • DataCategory.ARTICLES maps to ArticleCleaningHandler
    • DataCategory.REPOSITORIES maps to RepositoryCleaningHandler
  3. Validation — If no handler is found for the category, a ValueError is raised with a descriptive error message.
  4. Cleaning execution — The resolved handler's clean() method is invoked with the raw document, returning a cleaned VectorBaseDocument.

Cleaning Handlers

Each handler implements a clean() static method with category-specific logic:

Handler Input Type Output Type Cleaning Focus
PostCleaningHandler PostDocument CleanedPostDocument Social media formatting, hashtags, mentions, platform artifacts
ArticleCleaningHandler ArticleDocument CleanedArticleDocument HTML stripping, navigation removal, whitespace normalization
RepositoryCleaningHandler RepositoryDocument CleanedRepositoryDocument Build artifacts, binary references, non-code content

Usage Example

from llm_engineering.application.preprocessing.dispatchers import CleaningDispatcher
from llm_engineering.domain.documents import ArticleDocument, PostDocument

# Clean an article
raw_article = ArticleDocument.bulk_find(author_id=author_uuid)[0]
cleaned_article = CleaningDispatcher.dispatch(raw_article)
print(type(cleaned_article))  # CleanedArticleDocument

# Clean a post
raw_post = PostDocument.bulk_find(author_id=author_uuid)[0]
cleaned_post = CleaningDispatcher.dispatch(raw_post)
print(type(cleaned_post))  # CleanedPostDocument

External Dependencies

Dependency Purpose
loguru Structured logging for the cleaning process

Design Notes

  • The method is a staticmethod since it does not depend on instance or class state; it operates purely on the input document.
  • The factory dictionary is constructed inline within the method. This keeps the mapping co-located with the dispatch logic, making it easy to see all supported categories at a glance.
  • The type transformation from NoSQLBaseDocument to VectorBaseDocument marks a domain boundary crossing — the document transitions from the MongoDB storage domain to the Qdrant vector store domain.
  • Adding support for a new document category requires only creating a new cleaning handler class and adding an entry to the cleaning_factory dictionary.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment