Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:CrewAIInc CrewAI Scrape Website Tool

From Leeroopedia
Revision as of 11:09, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/CrewAIInc_CrewAI_Scrape_Website_Tool.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Tools, Web_Scraping
Last Updated 2026-02-11 00:00 GMT

Overview

ScrapeWebsiteTool fetches and extracts clean text content from web pages using HTTP requests and BeautifulSoup.

Description

ScrapeWebsiteTool extends BaseTool and follows the dual-schema pattern: ScrapeWebsiteToolSchema requires a website_url argument, while FixedScrapeWebsiteToolSchema is empty (used when the URL is pre-configured). The tool sets browser-like default headers to avoid request blocking. On initialization, it immediately checks for BeautifulSoup availability (raising an ImportError with install instructions if missing). If a website_url is provided at init time, it locks the tool to that URL and switches to the fixed schema. Cookie support reads values from environment variables. The _run() method performs an HTTP GET request with a 15-second timeout, sets encoding from apparent_encoding for proper character detection, parses HTML with BeautifulSoup, extracts all text using get_text(" "), and applies two regex substitutions to clean up whitespace: collapsing multiple spaces/tabs and condensing blank-line-heavy sections.

Usage

Use this tool as the primary lightweight web scraping option when agents need to read website content without requiring a browser engine. It is the simplest scraping option compared to Selenium, Scrapfly, or Scrapegraph alternatives.

Code Reference

Source Location

  • Repository: CrewAI
  • File: lib/crewai-tools/src/crewai_tools/tools/scrape_website_tool/scrape_website_tool.py
  • Lines: 1-89

Signature

class FixedScrapeWebsiteToolSchema(BaseModel):
    pass

class ScrapeWebsiteToolSchema(FixedScrapeWebsiteToolSchema):
    website_url: str = Field(..., description="Mandatory website url to read the file")

class ScrapeWebsiteTool(BaseTool):
    name: str = "Read website content"
    description: str = "A tool that can be used to read a website content."
    args_schema: type[BaseModel] = ScrapeWebsiteToolSchema
    website_url: str | None = None
    cookies: dict | None = None
    headers: dict | None  # default browser-like headers

    def __init__(self, website_url=None, cookies=None, **kwargs)
    def _run(self, **kwargs) -> Any

Import

from crewai_tools import ScrapeWebsiteTool

I/O Contract

Inputs

Name Type Required Description
website_url str Yes URL of the website to scrape (optional if set at init)

Outputs

Name Type Description
_run() returns str Cleaned text content of the website prefixed with "The following text is scraped website content:\n\n"

Usage Examples

Basic Usage

from crewai_tools import ScrapeWebsiteTool

# Dynamic URL
tool = ScrapeWebsiteTool()
result = tool._run(website_url="https://example.com")

# Pre-configured URL
tool = ScrapeWebsiteTool(website_url="https://example.com")
result = tool._run()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment