Implementation:PacktPublishing LLM Engineers Handbook BaseCrawler Extract

Aspect	Detail
Type	API Doc
API	`BaseCrawler.extract(link: str, **kwargs) -> None` (abstract), with concrete implementations in `LinkedInCrawler`, `MediumCrawler`, `GithubCrawler`, `CustomArticleCrawler`
Source	llm_engineering/application/crawlers/base.py:L11-66 (base), linkedin.py:L14-177, medium.py:L9-50, github.py:L13-69, custom_article.py:L12-55
Import	`from llm_engineering.application.crawlers import BaseCrawler, LinkedInCrawler, MediumCrawler, GithubCrawler`
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_Content_Crawling

Overview

The BaseCrawler abstract class and its concrete subclasses implement the content extraction logic for the Digital Data ETL pipeline. Each crawler specializes in extracting content from a specific web platform, transforming it into a typed domain document, and persisting it to MongoDB. The base class hierarchy provides shared infrastructure (browser automation, scrolling, driver management) while allowing platform-specific customization through the Template Method pattern.

Base Class Hierarchy

BaseCrawler (Abstract)

class BaseCrawler(ABC):
    model: type[NoSQLBaseDocument]

    @abstractmethod
    def extract(self, link: str, **kwargs) -> None:
        pass

The minimal abstract interface. All crawlers must:

Declare a model class attribute specifying the document type they produce
Implement extract() to perform the full extraction lifecycle

BaseSeleniumCrawler (Abstract)

class BaseSeleniumCrawler(BaseCrawler, ABC):
    def set_extra_driver_options(self, options) -> None:
        pass

    def login(self, driver) -> None:
        pass

    def extract(self, link: str, **kwargs) -> None:
        logger.info(f"Starting to crawl: {link}")

        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")
        options.add_argument("--disable-gpu")
        self.set_extra_driver_options(options)

        driver = webdriver.Chrome(options=options)
        try:
            driver.get(link)
            self.login(driver)
            self.scroll_page(driver)
            page_source = driver.page_source
            self.parse_page(page_source, **kwargs)
        finally:
            driver.quit()

    def scroll_page(self, driver) -> None:
        # Scrolls page to trigger lazy loading of content
        ...

    @abstractmethod
    def parse_page(self, page_source: str, **kwargs) -> None:
        pass

Provides the template method for Selenium-based crawlers with hooks for:

set_extra_driver_options() -- add platform-specific Chrome options
login() -- perform authentication if required
parse_page() -- extract content from rendered HTML (abstract, must override)

Concrete Implementations

LinkedInCrawler

Aspect	Detail
Source	llm_engineering/application/crawlers/linkedin.py:L14-177
Model	`PostDocument`
Base	`BaseSeleniumCrawler`
Auth	Overrides `login()` to authenticate with LinkedIn credentials
Parsing	Extracts post content from LinkedIn's DOM using BeautifulSoup, handles multiple post formats

MediumCrawler

Aspect	Detail
Source	llm_engineering/application/crawlers/medium.py:L9-50
Model	`ArticleDocument`
Base	`BaseSeleniumCrawler`
Auth	None (public articles)
Parsing	Extracts article content using BeautifulSoup, captures title and body text

GithubCrawler

Aspect	Detail
Source	llm_engineering/application/crawlers/github.py:L13-69
Model	`RepositoryDocument`
Base	`BaseCrawler` (directly, not Selenium-based)
Auth	Uses GitHub API token
Extraction	Uses LangChain `GithubRepositoryReader` to clone and parse repository contents, filters by file extension

CustomArticleCrawler

Aspect	Detail
Source	llm_engineering/application/crawlers/custom_article.py:L12-55
Model	`ArticleDocument`
Base	`BaseSeleniumCrawler`
Auth	None
Parsing	General-purpose HTML text extraction using BeautifulSoup, serves as fallback for unrecognized URLs

Inputs

Parameter	Type	Description
`link`	`str`	The URL to crawl and extract content from
`user`	`UserDocument`	Passed as `**kwargs`; the resolved user entity to associate with extracted content

Outputs

The extract() method returns None. Instead, it persists domain documents directly to MongoDB as a side effect:

Crawler	Document Type	MongoDB Collection
`LinkedInCrawler`	`PostDocument`	`posts`
`MediumCrawler`	`ArticleDocument`	`articles`
`GithubCrawler`	`RepositoryDocument`	`repositories`
`CustomArticleCrawler`	`ArticleDocument`	`articles`

External Dependencies

Dependency	Purpose
selenium	Browser automation for JavaScript-rendered pages (Chrome WebDriver)
bs4 (BeautifulSoup)	HTML parsing and DOM traversal for content extraction
langchain_community	`GithubRepositoryReader` for GitHub repository content access
loguru	Structured logging throughout the crawling process
pymongo	MongoDB persistence (via domain document `.save()` and `.bulk_insert()` methods)

Usage Example

from llm_engineering.application.crawlers import MediumCrawler
from llm_engineering.domain.documents import UserDocument

# Resolve user (typically done in a prior pipeline step)
user = UserDocument.get_or_create(first_name="Paul", last_name="Iusztin")

# Crawl a Medium article
crawler = MediumCrawler()
crawler.extract(
    link="https://medium.com/@pauliusztin/example-article-abc123",
    user=user,
)
# ArticleDocument is now persisted to MongoDB 'articles' collection

Source References

Base classes: llm_engineering/application/crawlers/base.py
LinkedIn crawler: llm_engineering/application/crawlers/linkedin.py
Medium crawler: llm_engineering/application/crawlers/medium.py
GitHub crawler: llm_engineering/application/crawlers/github.py
Custom article crawler: llm_engineering/application/crawlers/custom_article.py

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment