Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:PacktPublishing LLM Engineers Handbook Selenium Chrome Crawler Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Web_Scraping
Last Updated 2026-02-08 08:00 GMT

Overview

Headless Google Chrome environment with Selenium WebDriver for automated web scraping of LinkedIn, Medium, and custom article sites.

Description

This environment provides the browser automation stack required by the Digital Data ETL crawlers. It uses Google Chrome in headless mode driven by Selenium WebDriver with chromedriver-autoinstaller for automatic driver version matching. The crawlers navigate to URLs, scroll pages to load dynamic content, and extract HTML which is then parsed with BeautifulSoup and converted with html2text. Chrome is configured with sandbox-disabling flags for container compatibility.

Usage

Use this environment for the Digital Data ETL workflow, specifically for crawling LinkedIn profiles/posts, Medium articles, and custom web articles. GitHub repository crawling uses `git clone` instead and does not require this environment. The Dockerfile pre-installs Chrome for containerized deployments.

System Requirements

Category Requirement Notes
Browser Google Chrome (stable) Or Chromium-based browser
Driver ChromeDriver Auto-installed by `chromedriver-autoinstaller`
OS Linux / macOS / Windows Linux preferred for headless mode
Network Internet access Required for web crawling

Dependencies

System Packages

  • `google-chrome-stable` (or chromium-browser)
  • `gnupg` (for Chrome APT key)
  • `wget`, `curl` (for Chrome installation)

Python Packages

  • `selenium` >= 4.21.0
  • `webdriver-manager` >= 4.0.1
  • `chromedriver-autoinstaller` >= 0.6.4
  • `beautifulsoup4` >= 4.12.3
  • `html2text` >= 2024.2.26

Credentials

The following credentials are required for specific crawlers:

  • `LINKEDIN_USERNAME`: LinkedIn login email (required for LinkedIn crawler)
  • `LINKEDIN_PASSWORD`: LinkedIn login password (required for LinkedIn crawler)

Quick Install

# Install Chrome on Ubuntu/Debian
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | gpg --dearmor -o /usr/share/keyrings/google-linux-signing-key.gpg
echo "deb [signed-by=/usr/share/keyrings/google-linux-signing-key.gpg] https://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list
apt-get update && apt-get install -y google-chrome-stable

# Python packages are installed via Poetry (included in core dependencies)

Code Evidence

ChromeDriver auto-installation from `llm_engineering/application/crawlers/base.py:5-14`:

import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Check if the current version of chromedriver exists
# and if it doesn't exist, download it automatically,
# then add chromedriver to path
chromedriver_autoinstaller.install()

Chrome headless options from `llm_engineering/application/crawlers/base.py:28-40`:

options.add_argument("--no-sandbox")
options.add_argument("--headless=new")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--log-level=3")
options.add_argument("--disable-popup-blocking")
options.add_argument("--disable-notifications")
options.add_argument("--disable-extensions")
options.add_argument("--disable-background-networking")
options.add_argument("--ignore-certificate-errors")
options.add_argument(f"--user-data-dir={mkdtemp()}")
options.add_argument(f"--data-path={mkdtemp()}")
options.add_argument(f"--disk-cache-dir={mkdtemp()}")
options.add_argument("--remote-debugging-port=9226")

Chrome installation in `Dockerfile:10-17`:

RUN apt-get update -y && \
    apt-get install -y gnupg wget curl --no-install-recommends && \
    wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | \
    gpg --dearmor -o /usr/share/keyrings/google-linux-signing-key.gpg && \
    echo "deb [signed-by=...] https://dl.google.com/linux/chrome/deb/ stable main" \
    > /etc/apt/sources.list.d/google-chrome.list && \
    apt-get update -y && \
    apt-get install -y google-chrome-stable

Common Errors

Error Message Cause Solution
`WebDriverException: chrome not reachable` Chrome not installed Install Google Chrome or Chromium
`SessionNotCreatedException: version mismatch` ChromeDriver version mismatch Delete cached chromedriver, let autoinstaller re-download
`DevToolsActivePort file doesn't exist` Chrome crash in container Ensure `--no-sandbox` and `--disable-dev-shm-usage` flags are set

Compatibility Notes

  • Docker: The Dockerfile pre-installs Chrome. No additional setup needed for containerized runs.
  • macOS: Chrome must be installed manually via DMG or `brew install --cask google-chrome`.
  • Headless Mode: Uses `--headless=new` (Chrome 112+). Older Chrome versions may need `--headless` without `=new`.
  • LinkedIn: Requires valid LinkedIn credentials. LinkedIn may block automated access if too many requests are made.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment