Principle:PacktPublishing LLM Engineers Handbook Crawler Dispatch
| Aspect | Detail |
|---|---|
| Concept | URL-based dispatch / routing pattern for selecting the appropriate crawler |
| Workflow | Digital_Data_ETL |
| Pipeline Role | Crawler selection and routing (between user resolution and content crawling) |
| Implemented By | Implementation:PacktPublishing_LLM_Engineers_Handbook_CrawlerDispatcher_Build |
Overview
Crawler Dispatch is the principle of dynamically routing a given URL to the correct specialized content extractor based on URL domain patterns. In a multi-source data collection pipeline, different web platforms (LinkedIn, Medium, GitHub) require fundamentally different extraction strategies due to their distinct DOM structures, authentication requirements, and content formats. The Crawler Dispatch pattern provides a clean abstraction for this routing logic.
Theoretical Foundation
Strategy Pattern
The Crawler Dispatch pattern is a direct application of the Strategy design pattern (GoF). Each crawler class encapsulates a specific extraction algorithm, and the dispatcher selects the appropriate strategy at runtime based on the input URL:
Context (CrawlerDispatcher)
|
+-- Strategy Interface (BaseCrawler)
|
+-- ConcreteStrategyA (LinkedInCrawler)
+-- ConcreteStrategyB (MediumCrawler)
+-- ConcreteStrategyC (GithubCrawler)
+-- DefaultStrategy (CustomArticleCrawler)
The key benefit is that adding a new platform requires only:
- Creating a new crawler class implementing BaseCrawler
- Registering it in the dispatcher with a URL pattern
No existing code needs to be modified -- satisfying the Open/Closed Principle.
Registry Pattern
The dispatcher maintains an internal registry (dictionary) mapping URL domain patterns to crawler classes. This is a form of the Registry pattern where:
- Keys are URL domain substrings (e.g.,
"linkedin.com","medium.com") - Values are crawler class references (not instances -- classes are instantiated on demand)
The registry is populated through explicit registration methods using a fluent interface (method chaining), making the configuration both readable and composable.
Command Pattern Variant
The dispatch mechanism also resembles the Command pattern, where:
- The URL serves as the discriminator (determining which command to execute)
- Each crawler class is a command object that encapsulates the extraction logic
- The dispatcher acts as the invoker that selects and executes the appropriate command
Default/Fallback Strategy
A critical design aspect is the inclusion of a fallback strategy (CustomArticleCrawler). When no registered pattern matches the input URL, the dispatcher does not fail but instead returns a general-purpose crawler. This follows the Null Object pattern variant where the default behavior is a reasonable best-effort extraction rather than an error.
Usage
Crawler Dispatch is applied when building multi-source data collection systems that need to route different URLs to specialized extractors. The typical usage pattern is:
- Build the dispatcher instance using the class method
- Register all supported platform crawlers via the fluent interface
- For each URL to process, call get_crawler to obtain the appropriate crawler instance
- Invoke the crawler's extract method with the URL
This pattern scales well because:
- New platforms are added by implementing a new crawler and registering it
- The pipeline orchestration code does not change when new platforms are added
- URL-to-crawler mapping is centralized in one place, making it easy to audit and modify
Design Considerations
- Pattern Matching Strategy: The current implementation uses simple substring matching (
if pattern in link). This is efficient and sufficient for domain-level routing but does not support more complex matching (e.g., path-based routing within a domain). - Registration Order: Since matching iterates over the registry dictionary, the first matching pattern wins. In practice, domain patterns are sufficiently distinct that order does not matter, but this is an implicit assumption.
- Instance vs. Class Registration: The registry stores classes, not instances. A new crawler instance is created for each
get_crawlercall. This avoids shared state between crawling operations but means crawler construction cost is paid per URL. - Fluent Interface: The
register_*methods returnself, enabling method chaining:CrawlerDispatcher.build().register_linkedin().register_medium().register_github()
Related Concepts
- Strategy Pattern (GoF) -- encapsulating interchangeable algorithms behind a common interface
- Registry Pattern -- maintaining a central mapping of keys to handlers
- Chain of Responsibility -- an alternative dispatch pattern where handlers are tried in sequence
- URL Routing (web frameworks) -- analogous pattern in Flask/Django for routing HTTP requests to handlers
- Plugin Architecture -- extensible systems where new capabilities are registered dynamically
See Also
- Implementation:PacktPublishing_LLM_Engineers_Handbook_CrawlerDispatcher_Build -- the concrete implementation of this principle
- Principle:PacktPublishing_LLM_Engineers_Handbook_Content_Crawling -- the extraction principle that dispatched crawlers implement
- GitHub: PacktPublishing/LLM-Engineers-Handbook