Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:CrewAIInc CrewAI RAG Webpage Loader

From Leeroopedia
Knowledge Sources
Domains RAG, Data_Loading, Web_Scraping
Last Updated 2026-02-11 00:00 GMT

Overview

Fetches web pages and extracts clean text content by removing HTML markup, scripts, styles, and normalizing whitespace.

Description

WebPageLoader extends BaseLoader to scrape and extract text from web pages. It uses requests with browser-like User-Agent, Accept, and Accept-Language headers to make HTTP GET requests with a 15-second timeout. The apparent_encoding of the response is used for proper character handling.

HTML parsing is performed by BeautifulSoup with the "html.parser" backend. The extraction process: 1. All script and style elements are decomposed from the DOM. 2. Text is extracted with space separators using get_text(" "). 3. Two precompiled regex patterns normalize whitespace: _SPACES_PATTERN collapses multiple spaces/tabs, and _NEWLINE_PATTERN cleans up whitespace around newlines.

The returned LoaderResult includes the cleaned text and metadata with the URL, page title (from <title> tag), HTTP status_code, and content-type header.

Usage

Import WebPageLoader when you need to scrape web page content. It is typically instantiated automatically by the DataType.WEBSITE registry when non-docs, non-GitHub web URLs are detected.

Code Reference

Source Location

  • Repository: CrewAI
  • File: lib/crewai-tools/src/crewai_tools/rag/loaders/webpage_loader.py
  • Lines: 1-59

Signature

class WebPageLoader(BaseLoader):
    def load(self, source_content: SourceContent, **kwargs) -> LoaderResult: ...

Import

from crewai_tools.rag.loaders.webpage_loader import WebPageLoader

I/O Contract

Inputs

Name Type Required Description
source_content SourceContent Yes Wraps a web page URL
headers dict No Custom HTTP headers (passed via kwargs; defaults to browser-like headers)

Outputs

Name Type Description
return LoaderResult Contains cleaned text content; metadata includes url, title, status_code, and content_type

Usage Examples

Basic Usage

from crewai_tools.rag.loaders.webpage_loader import WebPageLoader
from crewai_tools.rag.source_content import SourceContent

loader = WebPageLoader()

source = SourceContent("https://example.com/article")
result = loader.load(source)

print(result.content)
# Clean text extracted from the web page with HTML removed

print(result.metadata)
# {'url': 'https://example.com/article', 'title': 'Article Title', 'status_code': 200, 'content_type': 'text/html; charset=utf-8'}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment