Implementation:PacktPublishing LLM Engineers Handbook CrawlerDispatcher Build

Aspect	Detail
Type	API Doc
API	`CrawlerDispatcher.build() -> CrawlerDispatcher`, `.register_linkedin()`, `.register_medium()`, `.register_github()`, `.get_crawler(link: str) -> BaseCrawler`
Source	llm_engineering/application/crawlers/dispatcher.py:L10-51
Import	`from llm_engineering.application.crawlers.dispatcher import CrawlerDispatcher`
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_Crawler_Dispatch

Overview

The CrawlerDispatcher class implements URL-based routing to select the appropriate content crawler for a given link. It maintains an internal registry mapping URL domain patterns to crawler classes and provides a fluent interface for registration and a lookup method for dispatch.

Full Source

class CrawlerDispatcher:
    def __init__(self) -> None:
        self._crawlers = {}

    @classmethod
    def build(cls) -> "CrawlerDispatcher":
        dispatcher = cls()
        return dispatcher

    def register_linkedin(self) -> "CrawlerDispatcher":
        self._crawlers["linkedin.com"] = LinkedInCrawler
        return self

    def register_medium(self) -> "CrawlerDispatcher":
        self._crawlers["medium.com"] = MediumCrawler
        return self

    def register_github(self) -> "CrawlerDispatcher":
        self._crawlers["github.com"] = GithubCrawler
        return self

    def get_crawler(self, link: str) -> BaseCrawler:
        for pattern, crawler_cls in self._crawlers.items():
            if pattern in link:
                return crawler_cls()
        logger.warning(
            f"No crawler found for {link}. Defaulting to CustomArticleCrawler."
        )
        return CustomArticleCrawler()

Method Reference

build()

Aspect	Detail
Signature	`@classmethod build(cls) -> CrawlerDispatcher`
Inputs	None
Outputs	A new, empty `CrawlerDispatcher` instance with no registered crawlers
Description	Factory method that creates a fresh dispatcher. Typically followed by chained `register_*` calls.

register_linkedin()

Aspect	Detail
Signature	`register_linkedin(self) -> CrawlerDispatcher`
Pattern	`"linkedin.com"`
Crawler	`LinkedInCrawler`
Returns	`self` (for method chaining)

register_medium()

Aspect	Detail
Signature	`register_medium(self) -> CrawlerDispatcher`
Pattern	`"medium.com"`
Crawler	`MediumCrawler`
Returns	`self` (for method chaining)

register_github()

Aspect	Detail
Signature	`register_github(self) -> CrawlerDispatcher`
Pattern	`"github.com"`
Crawler	`GithubCrawler`
Returns	`self` (for method chaining)

get_crawler(link)

Aspect	Detail
Signature	`get_crawler(self, link: str) -> BaseCrawler`
Input	`link: str` -- the URL to find a matching crawler for
Output	An instance of the matching `BaseCrawler` subclass, or `CustomArticleCrawler()` if no pattern matches
Matching Logic	Iterates over registered patterns; returns the first crawler whose pattern is a substring of `link`
Fallback	Logs a warning via loguru and returns `CustomArticleCrawler()`

Typical Usage

from llm_engineering.application.crawlers.dispatcher import CrawlerDispatcher

# Build and configure the dispatcher with all supported platforms
dispatcher = (
    CrawlerDispatcher.build()
    .register_linkedin()
    .register_medium()
    .register_github()
)

# Dispatch a URL to the appropriate crawler
link = "https://medium.com/@user/some-article-123abc"
crawler = dispatcher.get_crawler(link)  # Returns MediumCrawler()

# Use the crawler to extract content
crawler.extract(link=link, user=user_document)

Inputs

Parameter	Type	Method	Description
(none)	--	`build()`	Factory method takes no parameters
`link`	`str`	`get_crawler()`	URL string to match against registered domain patterns

Outputs

Method	Return Type	Description
`build()`	`CrawlerDispatcher`	New dispatcher instance with empty registry
`register_*()`	`CrawlerDispatcher`	Same instance (`self`) for method chaining
`get_crawler()`	`BaseCrawler`	Instantiated crawler matching the URL pattern, or `CustomArticleCrawler` as fallback

Internal State

The dispatcher holds a single internal attribute:

Attribute	Type	Description
`_crawlers`	`dict[str, type[BaseCrawler]]`	Maps URL domain substrings to crawler classes (not instances)

After full registration, the dictionary contents are:

{
    "linkedin.com": LinkedInCrawler,
    "medium.com": MediumCrawler,
    "github.com": GithubCrawler,
}

External Dependencies

Dependency	Purpose
loguru	Logging warnings when no crawler matches a given URL

All crawler classes (LinkedInCrawler, MediumCrawler, GithubCrawler, CustomArticleCrawler) are internal to the project.

Source References

Dispatcher: llm_engineering/application/crawlers/dispatcher.py:L10-51
Crawlers package: llm_engineering/application/crawlers/

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment