Workflow:PacktPublishing LLM Engineers Handbook Digital Data ETL
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Web_Scraping, ETL |
| Last Updated | 2026-02-08 07:45 GMT |
Overview
End-to-end process for collecting digital content (articles, posts, repositories) from multiple online platforms and persisting them as structured documents in a MongoDB data warehouse.
Description
This workflow implements the data collection layer of the LLM Twin system. It crawls an author's digital footprint across Medium, LinkedIn, and GitHub using platform-specific crawlers orchestrated by a dispatcher pattern. Each crawler extracts content from the source platform and transforms it into domain-specific document models (ArticleDocument, PostDocument, RepositoryDocument) that are persisted to MongoDB. The pipeline is driven by ZenML and configured through YAML files that specify the target author and their content URLs.
Usage
Execute this workflow when you need to collect training data from an author's online presence. You have a list of URLs pointing to Medium articles, LinkedIn profiles/posts, or GitHub repositories, and you need to populate the data warehouse with structured documents that downstream pipelines (feature engineering, dataset generation) will consume.
Execution Steps
Step 1: User Resolution
Resolve the target author's full name into a UserDocument in MongoDB. If the user already exists, retrieve their record; otherwise, create a new UserDocument entry. This establishes the author identity that links all crawled content together.
Key considerations:
- The user full name is split into first and last name components
- User records are deduplicated by checking existing MongoDB entries
- The resulting UserDocument is passed to all subsequent crawling operations
Step 2: Crawler Initialization
Build and configure the CrawlerDispatcher by registering all platform-specific crawlers (LinkedIn, Medium, GitHub). The dispatcher uses a registry pattern where each crawler declares URL patterns it can handle via regex matching.
Key considerations:
- Each crawler is registered with URL patterns (e.g., linkedin.com, medium.com, github.com)
- A generic CustomArticleCrawler serves as fallback for unrecognized URLs
- The dispatcher routes each URL to the appropriate crawler at crawl time
Step 3: Content Crawling
Iterate through all configured URLs and dispatch each to the matching platform crawler. Each crawler extracts content using platform-appropriate techniques: Selenium browser automation for LinkedIn and Medium, Git clone operations for GitHub repositories, and HTTP requests with BeautifulSoup parsing for generic articles.
Key considerations:
- LinkedIn crawling requires Selenium with Chrome/Chromium for dynamic page rendering
- Medium articles are extracted via Selenium to handle JavaScript-rendered content
- GitHub repositories are cloned locally and source files are read into a tree structure
- Failed crawls are logged but do not halt the pipeline; processing continues with remaining URLs
Step 4: Document Persistence
Transform extracted content into typed domain documents (ArticleDocument, PostDocument, RepositoryDocument) and persist them to MongoDB. Each document type extends NoSQLBaseDocument, which provides the CRUD interface for the MongoDB data warehouse.
Key considerations:
- Documents are stored in MongoDB collections corresponding to their type
- Each document is linked to the author via the UserDocument reference
- Metadata (crawl timestamps, source URLs) is preserved for traceability
Step 5: Pipeline Reporting
Aggregate crawl results and attach metadata to the ZenML step context. This includes per-domain success/failure counts, enabling monitoring and debugging of crawl reliability through the ZenML dashboard.
Key considerations:
- Success and failure counts are tracked per domain (e.g., linkedin.com, medium.com)
- Metadata is visible in the ZenML pipeline dashboard for observability
- The list of crawled links is returned as a ZenML artifact for downstream dependency tracking