Implementation:CrewAIInc CrewAI RAG Webpage Loader

Knowledge Sources	CrewAI
Domains	RAG, Data_Loading, Web_Scraping
Last Updated	2026-02-11 00:00 GMT

Overview

Fetches web pages and extracts clean text content by removing HTML markup, scripts, styles, and normalizing whitespace.

Description

WebPageLoader extends BaseLoader to scrape and extract text from web pages. It uses requests with browser-like User-Agent, Accept, and Accept-Language headers to make HTTP GET requests with a 15-second timeout. The apparent_encoding of the response is used for proper character handling.

HTML parsing is performed by BeautifulSoup with the "html.parser" backend. The extraction process: 1. All script and style elements are decomposed from the DOM. 2. Text is extracted with space separators using get_text(" "). 3. Two precompiled regex patterns normalize whitespace: _SPACES_PATTERN collapses multiple spaces/tabs, and _NEWLINE_PATTERN cleans up whitespace around newlines.

The returned LoaderResult includes the cleaned text and metadata with the URL, page title (from <title> tag), HTTP status_code, and content-type header.

Usage

Import WebPageLoader when you need to scrape web page content. It is typically instantiated automatically by the DataType.WEBSITE registry when non-docs, non-GitHub web URLs are detected.

Code Reference

Source Location

Repository: CrewAI
File: lib/crewai-tools/src/crewai_tools/rag/loaders/webpage_loader.py
Lines: 1-59

Signature

class WebPageLoader(BaseLoader):
    def load(self, source_content: SourceContent, **kwargs) -> LoaderResult: ...

Import

from crewai_tools.rag.loaders.webpage_loader import WebPageLoader

I/O Contract

Inputs

Name	Type	Required	Description
source_content	SourceContent	Yes	Wraps a web page URL
headers	dict	No	Custom HTTP headers (passed via kwargs; defaults to browser-like headers)

Outputs

Name	Type	Description
return	LoaderResult	Contains cleaned text content; metadata includes url, title, status_code, and content_type

Usage Examples

Basic Usage

from crewai_tools.rag.loaders.webpage_loader import WebPageLoader
from crewai_tools.rag.source_content import SourceContent

loader = WebPageLoader()

source = SourceContent("https://example.com/article")
result = loader.load(source)

print(result.content)
# Clean text extracted from the web page with HTML removed

print(result.metadata)
# {'url': 'https://example.com/article', 'title': 'Article Title', 'status_code': 200, 'content_type': 'text/html; charset=utf-8'}

Related Pages

Principle:CrewAIInc_CrewAI_Knowledge_Ingestion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment