Implementation:CrewAIInc CrewAI Scrape Website Tool
| Knowledge Sources | |
|---|---|
| Domains | Tools, Web_Scraping |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
ScrapeWebsiteTool fetches and extracts clean text content from web pages using HTTP requests and BeautifulSoup.
Description
ScrapeWebsiteTool extends BaseTool and follows the dual-schema pattern: ScrapeWebsiteToolSchema requires a website_url argument, while FixedScrapeWebsiteToolSchema is empty (used when the URL is pre-configured). The tool sets browser-like default headers to avoid request blocking. On initialization, it immediately checks for BeautifulSoup availability (raising an ImportError with install instructions if missing). If a website_url is provided at init time, it locks the tool to that URL and switches to the fixed schema. Cookie support reads values from environment variables. The _run() method performs an HTTP GET request with a 15-second timeout, sets encoding from apparent_encoding for proper character detection, parses HTML with BeautifulSoup, extracts all text using get_text(" "), and applies two regex substitutions to clean up whitespace: collapsing multiple spaces/tabs and condensing blank-line-heavy sections.
Usage
Use this tool as the primary lightweight web scraping option when agents need to read website content without requiring a browser engine. It is the simplest scraping option compared to Selenium, Scrapfly, or Scrapegraph alternatives.
Code Reference
Source Location
- Repository: CrewAI
- File: lib/crewai-tools/src/crewai_tools/tools/scrape_website_tool/scrape_website_tool.py
- Lines: 1-89
Signature
class FixedScrapeWebsiteToolSchema(BaseModel):
pass
class ScrapeWebsiteToolSchema(FixedScrapeWebsiteToolSchema):
website_url: str = Field(..., description="Mandatory website url to read the file")
class ScrapeWebsiteTool(BaseTool):
name: str = "Read website content"
description: str = "A tool that can be used to read a website content."
args_schema: type[BaseModel] = ScrapeWebsiteToolSchema
website_url: str | None = None
cookies: dict | None = None
headers: dict | None # default browser-like headers
def __init__(self, website_url=None, cookies=None, **kwargs)
def _run(self, **kwargs) -> Any
Import
from crewai_tools import ScrapeWebsiteTool
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| website_url | str | Yes | URL of the website to scrape (optional if set at init) |
Outputs
| Name | Type | Description |
|---|---|---|
| _run() returns | str | Cleaned text content of the website prefixed with "The following text is scraped website content:\n\n" |
Usage Examples
Basic Usage
from crewai_tools import ScrapeWebsiteTool
# Dynamic URL
tool = ScrapeWebsiteTool()
result = tool._run(website_url="https://example.com")
# Pre-configured URL
tool = ScrapeWebsiteTool(website_url="https://example.com")
result = tool._run()