Implementation:CrewAIInc CrewAI RAG Webpage Loader
| Knowledge Sources | |
|---|---|
| Domains | RAG, Data_Loading, Web_Scraping |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Fetches web pages and extracts clean text content by removing HTML markup, scripts, styles, and normalizing whitespace.
Description
WebPageLoader extends BaseLoader to scrape and extract text from web pages. It uses requests with browser-like User-Agent, Accept, and Accept-Language headers to make HTTP GET requests with a 15-second timeout. The apparent_encoding of the response is used for proper character handling.
HTML parsing is performed by BeautifulSoup with the "html.parser" backend. The extraction process: 1. All script and style elements are decomposed from the DOM. 2. Text is extracted with space separators using get_text(" "). 3. Two precompiled regex patterns normalize whitespace: _SPACES_PATTERN collapses multiple spaces/tabs, and _NEWLINE_PATTERN cleans up whitespace around newlines.
The returned LoaderResult includes the cleaned text and metadata with the URL, page title (from <title> tag), HTTP status_code, and content-type header.
Usage
Import WebPageLoader when you need to scrape web page content. It is typically instantiated automatically by the DataType.WEBSITE registry when non-docs, non-GitHub web URLs are detected.
Code Reference
Source Location
- Repository: CrewAI
- File: lib/crewai-tools/src/crewai_tools/rag/loaders/webpage_loader.py
- Lines: 1-59
Signature
class WebPageLoader(BaseLoader):
def load(self, source_content: SourceContent, **kwargs) -> LoaderResult: ...
Import
from crewai_tools.rag.loaders.webpage_loader import WebPageLoader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| source_content | SourceContent | Yes | Wraps a web page URL |
| headers | dict | No | Custom HTTP headers (passed via kwargs; defaults to browser-like headers) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | LoaderResult | Contains cleaned text content; metadata includes url, title, status_code, and content_type |
Usage Examples
Basic Usage
from crewai_tools.rag.loaders.webpage_loader import WebPageLoader
from crewai_tools.rag.source_content import SourceContent
loader = WebPageLoader()
source = SourceContent("https://example.com/article")
result = loader.load(source)
print(result.content)
# Clean text extracted from the web page with HTML removed
print(result.metadata)
# {'url': 'https://example.com/article', 'title': 'Article Title', 'status_code': 200, 'content_type': 'text/html; charset=utf-8'}