Implementation:PacktPublishing LLM Engineers Handbook CrawlerDispatcher Build
Appearance
| Aspect | Detail |
|---|---|
| Type | API Doc |
| API | CrawlerDispatcher.build() -> CrawlerDispatcher, .register_linkedin(), .register_medium(), .register_github(), .get_crawler(link: str) -> BaseCrawler
|
| Source | llm_engineering/application/crawlers/dispatcher.py:L10-51 |
| Import | from llm_engineering.application.crawlers.dispatcher import CrawlerDispatcher
|
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Crawler_Dispatch |
Overview
The CrawlerDispatcher class implements URL-based routing to select the appropriate content crawler for a given link. It maintains an internal registry mapping URL domain patterns to crawler classes and provides a fluent interface for registration and a lookup method for dispatch.
Full Source
class CrawlerDispatcher:
def __init__(self) -> None:
self._crawlers = {}
@classmethod
def build(cls) -> "CrawlerDispatcher":
dispatcher = cls()
return dispatcher
def register_linkedin(self) -> "CrawlerDispatcher":
self._crawlers["linkedin.com"] = LinkedInCrawler
return self
def register_medium(self) -> "CrawlerDispatcher":
self._crawlers["medium.com"] = MediumCrawler
return self
def register_github(self) -> "CrawlerDispatcher":
self._crawlers["github.com"] = GithubCrawler
return self
def get_crawler(self, link: str) -> BaseCrawler:
for pattern, crawler_cls in self._crawlers.items():
if pattern in link:
return crawler_cls()
logger.warning(
f"No crawler found for {link}. Defaulting to CustomArticleCrawler."
)
return CustomArticleCrawler()
Method Reference
build()
| Aspect | Detail |
|---|---|
| Signature | @classmethod build(cls) -> CrawlerDispatcher
|
| Inputs | None |
| Outputs | A new, empty CrawlerDispatcher instance with no registered crawlers
|
| Description | Factory method that creates a fresh dispatcher. Typically followed by chained register_* calls.
|
register_linkedin()
| Aspect | Detail |
|---|---|
| Signature | register_linkedin(self) -> CrawlerDispatcher
|
| Pattern | "linkedin.com"
|
| Crawler | LinkedInCrawler
|
| Returns | self (for method chaining)
|
register_medium()
| Aspect | Detail |
|---|---|
| Signature | register_medium(self) -> CrawlerDispatcher
|
| Pattern | "medium.com"
|
| Crawler | MediumCrawler
|
| Returns | self (for method chaining)
|
register_github()
| Aspect | Detail |
|---|---|
| Signature | register_github(self) -> CrawlerDispatcher
|
| Pattern | "github.com"
|
| Crawler | GithubCrawler
|
| Returns | self (for method chaining)
|
get_crawler(link)
| Aspect | Detail |
|---|---|
| Signature | get_crawler(self, link: str) -> BaseCrawler
|
| Input | link: str -- the URL to find a matching crawler for
|
| Output | An instance of the matching BaseCrawler subclass, or CustomArticleCrawler() if no pattern matches
|
| Matching Logic | Iterates over registered patterns; returns the first crawler whose pattern is a substring of link
|
| Fallback | Logs a warning via loguru and returns CustomArticleCrawler()
|
Typical Usage
from llm_engineering.application.crawlers.dispatcher import CrawlerDispatcher
# Build and configure the dispatcher with all supported platforms
dispatcher = (
CrawlerDispatcher.build()
.register_linkedin()
.register_medium()
.register_github()
)
# Dispatch a URL to the appropriate crawler
link = "https://medium.com/@user/some-article-123abc"
crawler = dispatcher.get_crawler(link) # Returns MediumCrawler()
# Use the crawler to extract content
crawler.extract(link=link, user=user_document)
Inputs
| Parameter | Type | Method | Description |
|---|---|---|---|
| (none) | -- | build() |
Factory method takes no parameters |
link |
str |
get_crawler() |
URL string to match against registered domain patterns |
Outputs
| Method | Return Type | Description |
|---|---|---|
build() |
CrawlerDispatcher |
New dispatcher instance with empty registry |
register_*() |
CrawlerDispatcher |
Same instance (self) for method chaining
|
get_crawler() |
BaseCrawler |
Instantiated crawler matching the URL pattern, or CustomArticleCrawler as fallback
|
Internal State
The dispatcher holds a single internal attribute:
| Attribute | Type | Description |
|---|---|---|
_crawlers |
dict[str, type[BaseCrawler]] |
Maps URL domain substrings to crawler classes (not instances) |
After full registration, the dictionary contents are:
{
"linkedin.com": LinkedInCrawler,
"medium.com": MediumCrawler,
"github.com": GithubCrawler,
}
External Dependencies
| Dependency | Purpose |
|---|---|
| loguru | Logging warnings when no crawler matches a given URL |
All crawler classes (LinkedInCrawler, MediumCrawler, GithubCrawler, CustomArticleCrawler) are internal to the project.
Source References
- Dispatcher: llm_engineering/application/crawlers/dispatcher.py:L10-51
- Crawlers package: llm_engineering/application/crawlers/
See Also
- Principle:PacktPublishing_LLM_Engineers_Handbook_Crawler_Dispatch -- the principle this implements
- Implementation:PacktPublishing_LLM_Engineers_Handbook_BaseCrawler_Extract -- the crawlers that the dispatcher selects
- Environment:PacktPublishing_LLM_Engineers_Handbook_Selenium_Chrome_Crawler_Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment