Environment:PacktPublishing LLM Engineers Handbook Selenium Chrome Crawler Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Web_Scraping |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Headless Google Chrome environment with Selenium WebDriver for automated web scraping of LinkedIn, Medium, and custom article sites.
Description
This environment provides the browser automation stack required by the Digital Data ETL crawlers. It uses Google Chrome in headless mode driven by Selenium WebDriver with chromedriver-autoinstaller for automatic driver version matching. The crawlers navigate to URLs, scroll pages to load dynamic content, and extract HTML which is then parsed with BeautifulSoup and converted with html2text. Chrome is configured with sandbox-disabling flags for container compatibility.
Usage
Use this environment for the Digital Data ETL workflow, specifically for crawling LinkedIn profiles/posts, Medium articles, and custom web articles. GitHub repository crawling uses `git clone` instead and does not require this environment. The Dockerfile pre-installs Chrome for containerized deployments.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Browser | Google Chrome (stable) | Or Chromium-based browser |
| Driver | ChromeDriver | Auto-installed by `chromedriver-autoinstaller` |
| OS | Linux / macOS / Windows | Linux preferred for headless mode |
| Network | Internet access | Required for web crawling |
Dependencies
System Packages
- `google-chrome-stable` (or chromium-browser)
- `gnupg` (for Chrome APT key)
- `wget`, `curl` (for Chrome installation)
Python Packages
- `selenium` >= 4.21.0
- `webdriver-manager` >= 4.0.1
- `chromedriver-autoinstaller` >= 0.6.4
- `beautifulsoup4` >= 4.12.3
- `html2text` >= 2024.2.26
Credentials
The following credentials are required for specific crawlers:
- `LINKEDIN_USERNAME`: LinkedIn login email (required for LinkedIn crawler)
- `LINKEDIN_PASSWORD`: LinkedIn login password (required for LinkedIn crawler)
Quick Install
# Install Chrome on Ubuntu/Debian
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | gpg --dearmor -o /usr/share/keyrings/google-linux-signing-key.gpg
echo "deb [signed-by=/usr/share/keyrings/google-linux-signing-key.gpg] https://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list
apt-get update && apt-get install -y google-chrome-stable
# Python packages are installed via Poetry (included in core dependencies)
Code Evidence
ChromeDriver auto-installation from `llm_engineering/application/crawlers/base.py:5-14`:
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Check if the current version of chromedriver exists
# and if it doesn't exist, download it automatically,
# then add chromedriver to path
chromedriver_autoinstaller.install()
Chrome headless options from `llm_engineering/application/crawlers/base.py:28-40`:
options.add_argument("--no-sandbox")
options.add_argument("--headless=new")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--log-level=3")
options.add_argument("--disable-popup-blocking")
options.add_argument("--disable-notifications")
options.add_argument("--disable-extensions")
options.add_argument("--disable-background-networking")
options.add_argument("--ignore-certificate-errors")
options.add_argument(f"--user-data-dir={mkdtemp()}")
options.add_argument(f"--data-path={mkdtemp()}")
options.add_argument(f"--disk-cache-dir={mkdtemp()}")
options.add_argument("--remote-debugging-port=9226")
Chrome installation in `Dockerfile:10-17`:
RUN apt-get update -y && \
apt-get install -y gnupg wget curl --no-install-recommends && \
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | \
gpg --dearmor -o /usr/share/keyrings/google-linux-signing-key.gpg && \
echo "deb [signed-by=...] https://dl.google.com/linux/chrome/deb/ stable main" \
> /etc/apt/sources.list.d/google-chrome.list && \
apt-get update -y && \
apt-get install -y google-chrome-stable
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `WebDriverException: chrome not reachable` | Chrome not installed | Install Google Chrome or Chromium |
| `SessionNotCreatedException: version mismatch` | ChromeDriver version mismatch | Delete cached chromedriver, let autoinstaller re-download |
| `DevToolsActivePort file doesn't exist` | Chrome crash in container | Ensure `--no-sandbox` and `--disable-dev-shm-usage` flags are set |
Compatibility Notes
- Docker: The Dockerfile pre-installs Chrome. No additional setup needed for containerized runs.
- macOS: Chrome must be installed manually via DMG or `brew install --cask google-chrome`.
- Headless Mode: Uses `--headless=new` (Chrome 112+). Older Chrome versions may need `--headless` without `=new`.
- LinkedIn: Requires valid LinkedIn credentials. LinkedIn may block automated access if too many requests are made.