Environment:PacktPublishing LLM Engineers Handbook Selenium Chrome Crawler Environment

Knowledge Sources	LLM Engineers Handbook Selenium
Domains	Infrastructure, Web_Scraping
Last Updated	2026-02-08 08:00 GMT

Overview

Headless Google Chrome environment with Selenium WebDriver for automated web scraping of LinkedIn, Medium, and custom article sites.

Description

This environment provides the browser automation stack required by the Digital Data ETL crawlers. It uses Google Chrome in headless mode driven by Selenium WebDriver with chromedriver-autoinstaller for automatic driver version matching. The crawlers navigate to URLs, scroll pages to load dynamic content, and extract HTML which is then parsed with BeautifulSoup and converted with html2text. Chrome is configured with sandbox-disabling flags for container compatibility.

Usage

Use this environment for the Digital Data ETL workflow, specifically for crawling LinkedIn profiles/posts, Medium articles, and custom web articles. GitHub repository crawling uses `git clone` instead and does not require this environment. The Dockerfile pre-installs Chrome for containerized deployments.

System Requirements

Category	Requirement	Notes
Browser	Google Chrome (stable)	Or Chromium-based browser
Driver	ChromeDriver	Auto-installed by `chromedriver-autoinstaller`
OS	Linux / macOS / Windows	Linux preferred for headless mode
Network	Internet access	Required for web crawling

Dependencies

System Packages

`google-chrome-stable` (or chromium-browser)
`gnupg` (for Chrome APT key)
`wget`, `curl` (for Chrome installation)

Python Packages

`selenium` >= 4.21.0
`webdriver-manager` >= 4.0.1
`chromedriver-autoinstaller` >= 0.6.4
`beautifulsoup4` >= 4.12.3
`html2text` >= 2024.2.26

Credentials

The following credentials are required for specific crawlers:

`LINKEDIN_USERNAME`: LinkedIn login email (required for LinkedIn crawler)
`LINKEDIN_PASSWORD`: LinkedIn login password (required for LinkedIn crawler)

Quick Install

# Install Chrome on Ubuntu/Debian
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | gpg --dearmor -o /usr/share/keyrings/google-linux-signing-key.gpg
echo "deb [signed-by=/usr/share/keyrings/google-linux-signing-key.gpg] https://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list
apt-get update && apt-get install -y google-chrome-stable

# Python packages are installed via Poetry (included in core dependencies)

Code Evidence

ChromeDriver auto-installation from `llm_engineering/application/crawlers/base.py:5-14`:

import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Check if the current version of chromedriver exists
# and if it doesn't exist, download it automatically,
# then add chromedriver to path
chromedriver_autoinstaller.install()

Chrome headless options from `llm_engineering/application/crawlers/base.py:28-40`:

options.add_argument("--no-sandbox")
options.add_argument("--headless=new")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--log-level=3")
options.add_argument("--disable-popup-blocking")
options.add_argument("--disable-notifications")
options.add_argument("--disable-extensions")
options.add_argument("--disable-background-networking")
options.add_argument("--ignore-certificate-errors")
options.add_argument(f"--user-data-dir={mkdtemp()}")
options.add_argument(f"--data-path={mkdtemp()}")
options.add_argument(f"--disk-cache-dir={mkdtemp()}")
options.add_argument("--remote-debugging-port=9226")

Chrome installation in `Dockerfile:10-17`:

RUN apt-get update -y && \
    apt-get install -y gnupg wget curl --no-install-recommends && \
    wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | \
    gpg --dearmor -o /usr/share/keyrings/google-linux-signing-key.gpg && \
    echo "deb [signed-by=...] https://dl.google.com/linux/chrome/deb/ stable main" \
    > /etc/apt/sources.list.d/google-chrome.list && \
    apt-get update -y && \
    apt-get install -y google-chrome-stable

Common Errors

Error Message	Cause	Solution
`WebDriverException: chrome not reachable`	Chrome not installed	Install Google Chrome or Chromium
`SessionNotCreatedException: version mismatch`	ChromeDriver version mismatch	Delete cached chromedriver, let autoinstaller re-download
`DevToolsActivePort file doesn't exist`	Chrome crash in container	Ensure `--no-sandbox` and `--disable-dev-shm-usage` flags are set

Compatibility Notes

Docker: The Dockerfile pre-installs Chrome. No additional setup needed for containerized runs.
macOS: Chrome must be installed manually via DMG or `brew install --cask google-chrome`.
Headless Mode: Uses `--headless=new` (Chrome 112+). Older Chrome versions may need `--headless` without `=new`.
LinkedIn: Requires valid LinkedIn credentials. LinkedIn may block automated access if too many requests are made.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment