Principle:PacktPublishing LLM Engineers Handbook Content Crawling
| Aspect | Detail |
|---|---|
| Concept | Automated web content extraction (web scraping) |
| Workflow | Digital_Data_ETL |
| Pipeline Role | Data collection (core extraction step) |
| Implemented By | Implementation:PacktPublishing_LLM_Engineers_Handbook_BaseCrawler_Extract |
Overview
Content Crawling is the principle of programmatic retrieval of structured data from web pages. In the Digital Data ETL pipeline, this is the core data collection step where raw content is extracted from multiple web platforms (LinkedIn, Medium, GitHub) and transformed into structured domain documents for downstream processing. Each platform requires a specialized extraction strategy due to differences in rendering technology, DOM structure, authentication requirements, and content format.
Theoretical Foundation
Web Scraping Paradigms
Two fundamental paradigms exist for web content extraction:
- Static HTML Parsing: Fetching the raw HTML response and parsing it with libraries like BeautifulSoup. This works well for server-rendered content where the HTML contains the full page content.
- Browser Automation: Using a headless browser (Selenium, Playwright) to render JavaScript-heavy pages, then extracting content from the rendered DOM. This is necessary for single-page applications (SPAs) and platforms that load content dynamically.
The Digital Data ETL pipeline uses both paradigms depending on the target platform:
- LinkedIn and Medium use Selenium-based browser automation (via
BaseSeleniumCrawler) because these platforms rely heavily on client-side rendering - GitHub uses the LangChain
GithubRepositoryReaderAPI for structured repository access - Custom articles use static HTML parsing with BeautifulSoup
Template Method Pattern
The crawling architecture follows the Template Method design pattern (GoF). The base class defines the skeleton of the crawling algorithm, while subclasses override specific steps:
BaseCrawler (abstract)
|
+-- extract(link, **kwargs) [abstract - must override]
|
+-- BaseSeleniumCrawler
|
+-- extract(link, **kwargs) [template method - defined]
| 1. Configure Chrome options
| 2. Create WebDriver
| 3. Navigate to URL
| 4. Call self.login(driver) [hook - optional override]
| 5. Call self.scroll_page(driver)[defined - scrolls to load content]
| 6. Get page source
| 7. Call self.parse_page(...) [abstract - must override]
|
+-- set_extra_driver_options(options) [hook - optional override]
+-- login(driver) [hook - optional override]
+-- scroll_page(driver) [defined - scroll behavior]
+-- parse_page(...) [abstract - must override]
This structure provides:
- Code Reuse: Common browser setup, navigation, and scrolling logic is written once
- Customization Points: Subclasses customize only what differs (authentication, parsing, driver options)
- Consistent Interface: All crawlers expose the same
extract(link, **kwargs)interface
Platform-Specific Extraction Challenges
Each platform presents unique extraction challenges:
| Platform | Challenges | Strategy |
|---|---|---|
| Authentication required, dynamic content loading, anti-scraping measures, multiple post types | Selenium with login, page scrolling, BeautifulSoup parsing of rendered HTML | |
| Medium | Dynamic content loading, varied article structures, paywall considerations | Selenium with scrolling, BeautifulSoup parsing |
| GitHub | Repository structure (files, directories, branches), large codebases, API rate limits | LangChain GithubRepositoryReader with file filtering |
| Custom Articles | Unknown DOM structure, varied HTML quality, diverse content formats | Generic HTML parsing with BeautifulSoup, best-effort text extraction |
Usage
Content Crawling is applied when collecting training data from web sources for ML pipelines. The typical workflow is:
- A Crawler Dispatch step selects the appropriate crawler for a given URL
- The crawler's
extract()method is called with the URL and a UserDocument context - The crawler retrieves the page content (via browser automation or HTTP request)
- Platform-specific parsing extracts the relevant content fields
- The extracted content is saved as a typed domain document (ArticleDocument, PostDocument, or RepositoryDocument) to MongoDB
Each crawler is responsible for the full lifecycle of extraction: retrieval, parsing, transformation, and persistence. This follows the Active Record pattern where domain objects manage their own persistence.
Design Considerations
- Headless Browser Management: Selenium-based crawlers must manage browser lifecycle (creation, navigation, cleanup). The base class handles this to prevent resource leaks.
- Scroll-to-Load: Many platforms use infinite scroll or lazy loading. The base class implements configurable scroll behavior to ensure all content is loaded before parsing.
- Error Handling: Individual URL failures should not halt the entire pipeline. Each crawler handles exceptions locally, logging errors and continuing.
- Rate Limiting: Production deployments should implement rate limiting between requests to avoid IP bans or API throttling. Sleep intervals between page loads help manage this.
- Content Deduplication: The pipeline should handle duplicate URLs gracefully, either by checking for existing documents before crawling or by using upsert operations during persistence.
Related Concepts
- Web Scraping -- the general practice of extracting data from web pages
- Template Method Pattern (GoF) -- defining algorithm skeleton in base class, deferring steps to subclasses
- Browser Automation -- using headless browsers for JavaScript-rendered content extraction
- DOM Parsing -- traversing HTML document structure to extract specific elements
- Active Record Pattern -- domain objects that encapsulate both data and persistence logic
See Also
- Implementation:PacktPublishing_LLM_Engineers_Handbook_BaseCrawler_Extract -- the concrete implementation of this principle
- Principle:PacktPublishing_LLM_Engineers_Handbook_Crawler_Dispatch -- the dispatch mechanism that selects crawlers
- Principle:PacktPublishing_LLM_Engineers_Handbook_Document_Persistence -- how extracted content is persisted
- GitHub: PacktPublishing/LLM-Engineers-Handbook