Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook BaseCrawler Extract

From Leeroopedia


Aspect Detail
Type API Doc
API BaseCrawler.extract(link: str, **kwargs) -> None (abstract), with concrete implementations in LinkedInCrawler, MediumCrawler, GithubCrawler, CustomArticleCrawler
Source llm_engineering/application/crawlers/base.py:L11-66 (base), linkedin.py:L14-177, medium.py:L9-50, github.py:L13-69, custom_article.py:L12-55
Import from llm_engineering.application.crawlers import BaseCrawler, LinkedInCrawler, MediumCrawler, GithubCrawler
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Content_Crawling

Overview

The BaseCrawler abstract class and its concrete subclasses implement the content extraction logic for the Digital Data ETL pipeline. Each crawler specializes in extracting content from a specific web platform, transforming it into a typed domain document, and persisting it to MongoDB. The base class hierarchy provides shared infrastructure (browser automation, scrolling, driver management) while allowing platform-specific customization through the Template Method pattern.

Base Class Hierarchy

BaseCrawler (Abstract)

class BaseCrawler(ABC):
    model: type[NoSQLBaseDocument]

    @abstractmethod
    def extract(self, link: str, **kwargs) -> None:
        pass

The minimal abstract interface. All crawlers must:

  • Declare a model class attribute specifying the document type they produce
  • Implement extract() to perform the full extraction lifecycle

BaseSeleniumCrawler (Abstract)

class BaseSeleniumCrawler(BaseCrawler, ABC):
    def set_extra_driver_options(self, options) -> None:
        pass

    def login(self, driver) -> None:
        pass

    def extract(self, link: str, **kwargs) -> None:
        logger.info(f"Starting to crawl: {link}")

        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")
        options.add_argument("--disable-gpu")
        self.set_extra_driver_options(options)

        driver = webdriver.Chrome(options=options)
        try:
            driver.get(link)
            self.login(driver)
            self.scroll_page(driver)
            page_source = driver.page_source
            self.parse_page(page_source, **kwargs)
        finally:
            driver.quit()

    def scroll_page(self, driver) -> None:
        # Scrolls page to trigger lazy loading of content
        ...

    @abstractmethod
    def parse_page(self, page_source: str, **kwargs) -> None:
        pass

Provides the template method for Selenium-based crawlers with hooks for:

  • set_extra_driver_options() -- add platform-specific Chrome options
  • login() -- perform authentication if required
  • parse_page() -- extract content from rendered HTML (abstract, must override)

Concrete Implementations

LinkedInCrawler

Aspect Detail
Source llm_engineering/application/crawlers/linkedin.py:L14-177
Model PostDocument
Base BaseSeleniumCrawler
Auth Overrides login() to authenticate with LinkedIn credentials
Parsing Extracts post content from LinkedIn's DOM using BeautifulSoup, handles multiple post formats

MediumCrawler

Aspect Detail
Source llm_engineering/application/crawlers/medium.py:L9-50
Model ArticleDocument
Base BaseSeleniumCrawler
Auth None (public articles)
Parsing Extracts article content using BeautifulSoup, captures title and body text

GithubCrawler

Aspect Detail
Source llm_engineering/application/crawlers/github.py:L13-69
Model RepositoryDocument
Base BaseCrawler (directly, not Selenium-based)
Auth Uses GitHub API token
Extraction Uses LangChain GithubRepositoryReader to clone and parse repository contents, filters by file extension

CustomArticleCrawler

Aspect Detail
Source llm_engineering/application/crawlers/custom_article.py:L12-55
Model ArticleDocument
Base BaseSeleniumCrawler
Auth None
Parsing General-purpose HTML text extraction using BeautifulSoup, serves as fallback for unrecognized URLs

Inputs

Parameter Type Description
link str The URL to crawl and extract content from
user UserDocument Passed as **kwargs; the resolved user entity to associate with extracted content

Outputs

The extract() method returns None. Instead, it persists domain documents directly to MongoDB as a side effect:

Crawler Document Type MongoDB Collection
LinkedInCrawler PostDocument posts
MediumCrawler ArticleDocument articles
GithubCrawler RepositoryDocument repositories
CustomArticleCrawler ArticleDocument articles

External Dependencies

Dependency Purpose
selenium Browser automation for JavaScript-rendered pages (Chrome WebDriver)
bs4 (BeautifulSoup) HTML parsing and DOM traversal for content extraction
langchain_community GithubRepositoryReader for GitHub repository content access
loguru Structured logging throughout the crawling process
pymongo MongoDB persistence (via domain document .save() and .bulk_insert() methods)

Usage Example

from llm_engineering.application.crawlers import MediumCrawler
from llm_engineering.domain.documents import UserDocument

# Resolve user (typically done in a prior pipeline step)
user = UserDocument.get_or_create(first_name="Paul", last_name="Iusztin")

# Crawl a Medium article
crawler = MediumCrawler()
crawler.extract(
    link="https://medium.com/@pauliusztin/example-article-abc123",
    user=user,
)
# ArticleDocument is now persisted to MongoDB 'articles' collection

Source References

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment