Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:PacktPublishing LLM Engineers Handbook Crawler Dispatch

From Leeroopedia


Aspect Detail
Concept URL-based dispatch / routing pattern for selecting the appropriate crawler
Workflow Digital_Data_ETL
Pipeline Role Crawler selection and routing (between user resolution and content crawling)
Implemented By Implementation:PacktPublishing_LLM_Engineers_Handbook_CrawlerDispatcher_Build

Overview

Crawler Dispatch is the principle of dynamically routing a given URL to the correct specialized content extractor based on URL domain patterns. In a multi-source data collection pipeline, different web platforms (LinkedIn, Medium, GitHub) require fundamentally different extraction strategies due to their distinct DOM structures, authentication requirements, and content formats. The Crawler Dispatch pattern provides a clean abstraction for this routing logic.

Theoretical Foundation

Strategy Pattern

The Crawler Dispatch pattern is a direct application of the Strategy design pattern (GoF). Each crawler class encapsulates a specific extraction algorithm, and the dispatcher selects the appropriate strategy at runtime based on the input URL:

Context (CrawlerDispatcher)
  |
  +-- Strategy Interface (BaseCrawler)
        |
        +-- ConcreteStrategyA (LinkedInCrawler)
        +-- ConcreteStrategyB (MediumCrawler)
        +-- ConcreteStrategyC (GithubCrawler)
        +-- DefaultStrategy   (CustomArticleCrawler)

The key benefit is that adding a new platform requires only:

  1. Creating a new crawler class implementing BaseCrawler
  2. Registering it in the dispatcher with a URL pattern

No existing code needs to be modified -- satisfying the Open/Closed Principle.

Registry Pattern

The dispatcher maintains an internal registry (dictionary) mapping URL domain patterns to crawler classes. This is a form of the Registry pattern where:

  • Keys are URL domain substrings (e.g., "linkedin.com", "medium.com")
  • Values are crawler class references (not instances -- classes are instantiated on demand)

The registry is populated through explicit registration methods using a fluent interface (method chaining), making the configuration both readable and composable.

Command Pattern Variant

The dispatch mechanism also resembles the Command pattern, where:

  • The URL serves as the discriminator (determining which command to execute)
  • Each crawler class is a command object that encapsulates the extraction logic
  • The dispatcher acts as the invoker that selects and executes the appropriate command

Default/Fallback Strategy

A critical design aspect is the inclusion of a fallback strategy (CustomArticleCrawler). When no registered pattern matches the input URL, the dispatcher does not fail but instead returns a general-purpose crawler. This follows the Null Object pattern variant where the default behavior is a reasonable best-effort extraction rather than an error.

Usage

Crawler Dispatch is applied when building multi-source data collection systems that need to route different URLs to specialized extractors. The typical usage pattern is:

  1. Build the dispatcher instance using the class method
  2. Register all supported platform crawlers via the fluent interface
  3. For each URL to process, call get_crawler to obtain the appropriate crawler instance
  4. Invoke the crawler's extract method with the URL

This pattern scales well because:

  • New platforms are added by implementing a new crawler and registering it
  • The pipeline orchestration code does not change when new platforms are added
  • URL-to-crawler mapping is centralized in one place, making it easy to audit and modify

Design Considerations

  • Pattern Matching Strategy: The current implementation uses simple substring matching (if pattern in link). This is efficient and sufficient for domain-level routing but does not support more complex matching (e.g., path-based routing within a domain).
  • Registration Order: Since matching iterates over the registry dictionary, the first matching pattern wins. In practice, domain patterns are sufficiently distinct that order does not matter, but this is an implicit assumption.
  • Instance vs. Class Registration: The registry stores classes, not instances. A new crawler instance is created for each get_crawler call. This avoids shared state between crawling operations but means crawler construction cost is paid per URL.
  • Fluent Interface: The register_* methods return self, enabling method chaining: CrawlerDispatcher.build().register_linkedin().register_medium().register_github()

Related Concepts

  • Strategy Pattern (GoF) -- encapsulating interchangeable algorithms behind a common interface
  • Registry Pattern -- maintaining a central mapping of keys to handlers
  • Chain of Responsibility -- an alternative dispatch pattern where handlers are tried in sequence
  • URL Routing (web frameworks) -- analogous pattern in Flask/Django for routing HTTP requests to handlers
  • Plugin Architecture -- extensible systems where new capabilities are registered dynamically

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment