Principle:PacktPublishing LLM Engineers Handbook Crawler Dispatch

Aspect	Detail
Concept	URL-based dispatch / routing pattern for selecting the appropriate crawler
Workflow	Digital_Data_ETL
Pipeline Role	Crawler selection and routing (between user resolution and content crawling)
Implemented By	Implementation:PacktPublishing_LLM_Engineers_Handbook_CrawlerDispatcher_Build

Overview

Crawler Dispatch is the principle of dynamically routing a given URL to the correct specialized content extractor based on URL domain patterns. In a multi-source data collection pipeline, different web platforms (LinkedIn, Medium, GitHub) require fundamentally different extraction strategies due to their distinct DOM structures, authentication requirements, and content formats. The Crawler Dispatch pattern provides a clean abstraction for this routing logic.

Theoretical Foundation

Strategy Pattern

The Crawler Dispatch pattern is a direct application of the Strategy design pattern (GoF). Each crawler class encapsulates a specific extraction algorithm, and the dispatcher selects the appropriate strategy at runtime based on the input URL:

Context (CrawlerDispatcher)
  |
  +-- Strategy Interface (BaseCrawler)
        |
        +-- ConcreteStrategyA (LinkedInCrawler)
        +-- ConcreteStrategyB (MediumCrawler)
        +-- ConcreteStrategyC (GithubCrawler)
        +-- DefaultStrategy   (CustomArticleCrawler)

The key benefit is that adding a new platform requires only:

Creating a new crawler class implementing BaseCrawler
Registering it in the dispatcher with a URL pattern

No existing code needs to be modified -- satisfying the Open/Closed Principle.

Registry Pattern

The dispatcher maintains an internal registry (dictionary) mapping URL domain patterns to crawler classes. This is a form of the Registry pattern where:

Keys are URL domain substrings (e.g., "linkedin.com", "medium.com")
Values are crawler class references (not instances -- classes are instantiated on demand)

The registry is populated through explicit registration methods using a fluent interface (method chaining), making the configuration both readable and composable.

Command Pattern Variant

The dispatch mechanism also resembles the Command pattern, where:

The URL serves as the discriminator (determining which command to execute)
Each crawler class is a command object that encapsulates the extraction logic
The dispatcher acts as the invoker that selects and executes the appropriate command

Default/Fallback Strategy

A critical design aspect is the inclusion of a fallback strategy (CustomArticleCrawler). When no registered pattern matches the input URL, the dispatcher does not fail but instead returns a general-purpose crawler. This follows the Null Object pattern variant where the default behavior is a reasonable best-effort extraction rather than an error.

Usage

Crawler Dispatch is applied when building multi-source data collection systems that need to route different URLs to specialized extractors. The typical usage pattern is:

Build the dispatcher instance using the class method
Register all supported platform crawlers via the fluent interface
For each URL to process, call get_crawler to obtain the appropriate crawler instance
Invoke the crawler's extract method with the URL

This pattern scales well because:

New platforms are added by implementing a new crawler and registering it
The pipeline orchestration code does not change when new platforms are added
URL-to-crawler mapping is centralized in one place, making it easy to audit and modify

Design Considerations

Pattern Matching Strategy: The current implementation uses simple substring matching (if pattern in link). This is efficient and sufficient for domain-level routing but does not support more complex matching (e.g., path-based routing within a domain).
Registration Order: Since matching iterates over the registry dictionary, the first matching pattern wins. In practice, domain patterns are sufficiently distinct that order does not matter, but this is an implicit assumption.
Instance vs. Class Registration: The registry stores classes, not instances. A new crawler instance is created for each get_crawler call. This avoids shared state between crawling operations but means crawler construction cost is paid per URL.
Fluent Interface: The register_* methods return self, enabling method chaining: CrawlerDispatcher.build().register_linkedin().register_medium().register_github()

Related Concepts

Strategy Pattern (GoF) -- encapsulating interchangeable algorithms behind a common interface
Registry Pattern -- maintaining a central mapping of keys to handlers
Chain of Responsibility -- an alternative dispatch pattern where handlers are tried in sequence
URL Routing (web frameworks) -- analogous pattern in Flask/Django for routing HTTP requests to handlers
Plugin Architecture -- extensible systems where new capabilities are registered dynamically

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment