| Aspect |
Detail
|
| Type |
API Doc
|
| API |
BaseCrawler.extract(link: str, **kwargs) -> None (abstract), with concrete implementations in LinkedInCrawler, MediumCrawler, GithubCrawler, CustomArticleCrawler
|
| Source |
llm_engineering/application/crawlers/base.py:L11-66 (base), linkedin.py:L14-177, medium.py:L9-50, github.py:L13-69, custom_article.py:L12-55
|
| Import |
from llm_engineering.application.crawlers import BaseCrawler, LinkedInCrawler, MediumCrawler, GithubCrawler
|
| Implements |
Principle:PacktPublishing_LLM_Engineers_Handbook_Content_Crawling
|
Overview
The BaseCrawler abstract class and its concrete subclasses implement the content extraction logic for the Digital Data ETL pipeline. Each crawler specializes in extracting content from a specific web platform, transforming it into a typed domain document, and persisting it to MongoDB. The base class hierarchy provides shared infrastructure (browser automation, scrolling, driver management) while allowing platform-specific customization through the Template Method pattern.
Base Class Hierarchy
BaseCrawler (Abstract)
class BaseCrawler(ABC):
model: type[NoSQLBaseDocument]
@abstractmethod
def extract(self, link: str, **kwargs) -> None:
pass
The minimal abstract interface. All crawlers must:
- Declare a
model class attribute specifying the document type they produce
- Implement
extract() to perform the full extraction lifecycle
BaseSeleniumCrawler (Abstract)
class BaseSeleniumCrawler(BaseCrawler, ABC):
def set_extra_driver_options(self, options) -> None:
pass
def login(self, driver) -> None:
pass
def extract(self, link: str, **kwargs) -> None:
logger.info(f"Starting to crawl: {link}")
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
self.set_extra_driver_options(options)
driver = webdriver.Chrome(options=options)
try:
driver.get(link)
self.login(driver)
self.scroll_page(driver)
page_source = driver.page_source
self.parse_page(page_source, **kwargs)
finally:
driver.quit()
def scroll_page(self, driver) -> None:
# Scrolls page to trigger lazy loading of content
...
@abstractmethod
def parse_page(self, page_source: str, **kwargs) -> None:
pass
Provides the template method for Selenium-based crawlers with hooks for:
set_extra_driver_options() -- add platform-specific Chrome options
login() -- perform authentication if required
parse_page() -- extract content from rendered HTML (abstract, must override)
Concrete Implementations
LinkedInCrawler
| Aspect |
Detail
|
| Source |
llm_engineering/application/crawlers/linkedin.py:L14-177
|
| Model |
PostDocument
|
| Base |
BaseSeleniumCrawler
|
| Auth |
Overrides login() to authenticate with LinkedIn credentials
|
| Parsing |
Extracts post content from LinkedIn's DOM using BeautifulSoup, handles multiple post formats
|
MediumCrawler
GithubCrawler
| Aspect |
Detail
|
| Source |
llm_engineering/application/crawlers/github.py:L13-69
|
| Model |
RepositoryDocument
|
| Base |
BaseCrawler (directly, not Selenium-based)
|
| Auth |
Uses GitHub API token
|
| Extraction |
Uses LangChain GithubRepositoryReader to clone and parse repository contents, filters by file extension
|
CustomArticleCrawler
Inputs
| Parameter |
Type |
Description
|
link |
str |
The URL to crawl and extract content from
|
user |
UserDocument |
Passed as **kwargs; the resolved user entity to associate with extracted content
|
Outputs
The extract() method returns None. Instead, it persists domain documents directly to MongoDB as a side effect:
| Crawler |
Document Type |
MongoDB Collection
|
LinkedInCrawler |
PostDocument |
posts
|
MediumCrawler |
ArticleDocument |
articles
|
GithubCrawler |
RepositoryDocument |
repositories
|
CustomArticleCrawler |
ArticleDocument |
articles
|
External Dependencies
| Dependency |
Purpose
|
| selenium |
Browser automation for JavaScript-rendered pages (Chrome WebDriver)
|
| bs4 (BeautifulSoup) |
HTML parsing and DOM traversal for content extraction
|
| langchain_community |
GithubRepositoryReader for GitHub repository content access
|
| loguru |
Structured logging throughout the crawling process
|
| pymongo |
MongoDB persistence (via domain document .save() and .bulk_insert() methods)
|
Usage Example
from llm_engineering.application.crawlers import MediumCrawler
from llm_engineering.domain.documents import UserDocument
# Resolve user (typically done in a prior pipeline step)
user = UserDocument.get_or_create(first_name="Paul", last_name="Iusztin")
# Crawl a Medium article
crawler = MediumCrawler()
crawler.extract(
link="https://medium.com/@pauliusztin/example-article-abc123",
user=user,
)
# ArticleDocument is now persisted to MongoDB 'articles' collection
Source References
See Also
Page Connections
Double-click a node to navigate. Hold to expand connections.