Workflow:PacktPublishing LLM Engineers Handbook Digital Data ETL

Knowledge Sources	LLM Engineers Handbook ZenML Docs Selenium Docs
Domains	Data_Engineering, Web_Scraping, ETL
Last Updated	2026-02-08 07:45 GMT

Overview

End-to-end process for collecting digital content (articles, posts, repositories) from multiple online platforms and persisting them as structured documents in a MongoDB data warehouse.

Description

This workflow implements the data collection layer of the LLM Twin system. It crawls an author's digital footprint across Medium, LinkedIn, and GitHub using platform-specific crawlers orchestrated by a dispatcher pattern. Each crawler extracts content from the source platform and transforms it into domain-specific document models (ArticleDocument, PostDocument, RepositoryDocument) that are persisted to MongoDB. The pipeline is driven by ZenML and configured through YAML files that specify the target author and their content URLs.

Usage

Execute this workflow when you need to collect training data from an author's online presence. You have a list of URLs pointing to Medium articles, LinkedIn profiles/posts, or GitHub repositories, and you need to populate the data warehouse with structured documents that downstream pipelines (feature engineering, dataset generation) will consume.

Execution Steps

Step 1: User Resolution

Resolve the target author's full name into a UserDocument in MongoDB. If the user already exists, retrieve their record; otherwise, create a new UserDocument entry. This establishes the author identity that links all crawled content together.

Key considerations:

The user full name is split into first and last name components
User records are deduplicated by checking existing MongoDB entries
The resulting UserDocument is passed to all subsequent crawling operations

Step 2: Crawler Initialization

Build and configure the CrawlerDispatcher by registering all platform-specific crawlers (LinkedIn, Medium, GitHub). The dispatcher uses a registry pattern where each crawler declares URL patterns it can handle via regex matching.

Key considerations:

Each crawler is registered with URL patterns (e.g., linkedin.com, medium.com, github.com)
A generic CustomArticleCrawler serves as fallback for unrecognized URLs
The dispatcher routes each URL to the appropriate crawler at crawl time

Step 3: Content Crawling

Iterate through all configured URLs and dispatch each to the matching platform crawler. Each crawler extracts content using platform-appropriate techniques: Selenium browser automation for LinkedIn and Medium, Git clone operations for GitHub repositories, and HTTP requests with BeautifulSoup parsing for generic articles.

Key considerations:

LinkedIn crawling requires Selenium with Chrome/Chromium for dynamic page rendering
Medium articles are extracted via Selenium to handle JavaScript-rendered content
GitHub repositories are cloned locally and source files are read into a tree structure
Failed crawls are logged but do not halt the pipeline; processing continues with remaining URLs

Step 4: Document Persistence

Transform extracted content into typed domain documents (ArticleDocument, PostDocument, RepositoryDocument) and persist them to MongoDB. Each document type extends NoSQLBaseDocument, which provides the CRUD interface for the MongoDB data warehouse.

Key considerations:

Documents are stored in MongoDB collections corresponding to their type
Each document is linked to the author via the UserDocument reference
Metadata (crawl timestamps, source URLs) is preserved for traceability

Step 5: Pipeline Reporting

Aggregate crawl results and attach metadata to the ZenML step context. This includes per-domain success/failure counts, enabling monitoring and debugging of crawl reliability through the ZenML dashboard.

Key considerations:

Success and failure counts are tracked per domain (e.g., linkedin.com, medium.com)
Metadata is visible in the ZenML pipeline dashboard for observability
The list of crawled links is returned as a ZenML artifact for downstream dependency tracking

Execution Diagram

GitHub URL

Workflow Repository