Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook CrawlerDispatcher Build

From Leeroopedia


Aspect Detail
Type API Doc
API CrawlerDispatcher.build() -> CrawlerDispatcher, .register_linkedin(), .register_medium(), .register_github(), .get_crawler(link: str) -> BaseCrawler
Source llm_engineering/application/crawlers/dispatcher.py:L10-51
Import from llm_engineering.application.crawlers.dispatcher import CrawlerDispatcher
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Crawler_Dispatch

Overview

The CrawlerDispatcher class implements URL-based routing to select the appropriate content crawler for a given link. It maintains an internal registry mapping URL domain patterns to crawler classes and provides a fluent interface for registration and a lookup method for dispatch.

Full Source

class CrawlerDispatcher:
    def __init__(self) -> None:
        self._crawlers = {}

    @classmethod
    def build(cls) -> "CrawlerDispatcher":
        dispatcher = cls()
        return dispatcher

    def register_linkedin(self) -> "CrawlerDispatcher":
        self._crawlers["linkedin.com"] = LinkedInCrawler
        return self

    def register_medium(self) -> "CrawlerDispatcher":
        self._crawlers["medium.com"] = MediumCrawler
        return self

    def register_github(self) -> "CrawlerDispatcher":
        self._crawlers["github.com"] = GithubCrawler
        return self

    def get_crawler(self, link: str) -> BaseCrawler:
        for pattern, crawler_cls in self._crawlers.items():
            if pattern in link:
                return crawler_cls()
        logger.warning(
            f"No crawler found for {link}. Defaulting to CustomArticleCrawler."
        )
        return CustomArticleCrawler()

Method Reference

build()

Aspect Detail
Signature @classmethod build(cls) -> CrawlerDispatcher
Inputs None
Outputs A new, empty CrawlerDispatcher instance with no registered crawlers
Description Factory method that creates a fresh dispatcher. Typically followed by chained register_* calls.

register_linkedin()

Aspect Detail
Signature register_linkedin(self) -> CrawlerDispatcher
Pattern "linkedin.com"
Crawler LinkedInCrawler
Returns self (for method chaining)

register_medium()

Aspect Detail
Signature register_medium(self) -> CrawlerDispatcher
Pattern "medium.com"
Crawler MediumCrawler
Returns self (for method chaining)

register_github()

Aspect Detail
Signature register_github(self) -> CrawlerDispatcher
Pattern "github.com"
Crawler GithubCrawler
Returns self (for method chaining)

get_crawler(link)

Aspect Detail
Signature get_crawler(self, link: str) -> BaseCrawler
Input link: str -- the URL to find a matching crawler for
Output An instance of the matching BaseCrawler subclass, or CustomArticleCrawler() if no pattern matches
Matching Logic Iterates over registered patterns; returns the first crawler whose pattern is a substring of link
Fallback Logs a warning via loguru and returns CustomArticleCrawler()

Typical Usage

from llm_engineering.application.crawlers.dispatcher import CrawlerDispatcher

# Build and configure the dispatcher with all supported platforms
dispatcher = (
    CrawlerDispatcher.build()
    .register_linkedin()
    .register_medium()
    .register_github()
)

# Dispatch a URL to the appropriate crawler
link = "https://medium.com/@user/some-article-123abc"
crawler = dispatcher.get_crawler(link)  # Returns MediumCrawler()

# Use the crawler to extract content
crawler.extract(link=link, user=user_document)

Inputs

Parameter Type Method Description
(none) -- build() Factory method takes no parameters
link str get_crawler() URL string to match against registered domain patterns

Outputs

Method Return Type Description
build() CrawlerDispatcher New dispatcher instance with empty registry
register_*() CrawlerDispatcher Same instance (self) for method chaining
get_crawler() BaseCrawler Instantiated crawler matching the URL pattern, or CustomArticleCrawler as fallback

Internal State

The dispatcher holds a single internal attribute:

Attribute Type Description
_crawlers dict[str, type[BaseCrawler]] Maps URL domain substrings to crawler classes (not instances)

After full registration, the dictionary contents are:

{
    "linkedin.com": LinkedInCrawler,
    "medium.com": MediumCrawler,
    "github.com": GithubCrawler,
}

External Dependencies

Dependency Purpose
loguru Logging warnings when no crawler matches a given URL

All crawler classes (LinkedInCrawler, MediumCrawler, GithubCrawler, CustomArticleCrawler) are internal to the project.

Source References

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment