Implementation:CrewAIInc CrewAI RAG Docs Site Loader
| Knowledge Sources | |
|---|---|
| Domains | RAG, Data_Loading, Web_Scraping |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Extracts and structures content from documentation websites by intelligently parsing HTML to identify main content areas, headings, and navigation links.
Description
DocsSiteLoader extends BaseLoader to fetch and process documentation websites. It uses requests to fetch HTML content (with a 30-second timeout) and BeautifulSoup for parsing.
The extraction process follows a multi-step strategy:
1. Script/style removal: All script and style elements are decomposed from the DOM. 2. Title extraction: The page title is extracted from the <title> tag. 3. Main content detection: The loader tries multiple CSS selectors in order: main, article, [role="main"], .content, #content, .documentation. If none match, it falls back to the <body> element. 4. Table of contents generation: Headings (h1-h3, limited to 15) are extracted with indentation reflecting their hierarchy level. 5. Text extraction: All text from the main content area is extracted with newline separators and whitespace cleanup. 6. Navigation links: Links from nav, .sidebar, .toc, and .navigation elements are collected (up to 20), with relative URLs resolved to absolute using urljoin.
Content is truncated at 100KB to prevent excessive memory usage. Metadata includes the source URL, page title, and domain name.
Usage
Import DocsSiteLoader when you need to load content from documentation websites. It is typically instantiated automatically by the DataType.DOCS_SITE registry when URLs containing "docs" in the hostname or path are detected.
Code Reference
Source Location
- Repository: CrewAI
- File: lib/crewai-tools/src/crewai_tools/rag/loaders/docs_site_loader.py
- Lines: 1-109
Signature
class DocsSiteLoader(BaseLoader):
def load(self, source: SourceContent, **kwargs) -> LoaderResult: ...
Import
from crewai_tools.rag.loaders.docs_site_loader import DocsSiteLoader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| source | SourceContent | Yes | Wraps a documentation site URL |
| **kwargs | Any | No | Additional keyword arguments (unused) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | LoaderResult | Contains structured text with title, table of contents, main content, and navigation links; metadata includes source URL, title, and domain |
Usage Examples
Basic Usage
from crewai_tools.rag.loaders.docs_site_loader import DocsSiteLoader
from crewai_tools.rag.source_content import SourceContent
loader = DocsSiteLoader()
source = SourceContent("https://docs.example.com/getting-started")
result = loader.load(source)
print(result.content)
# Title: Getting Started Guide
#
# Table of Contents:
# - Introduction
# - Prerequisites
# - Installation
# ...
print(result.metadata)
# {'source': 'https://docs.example.com/getting-started', 'title': 'Getting Started Guide', 'domain': 'docs.example.com'}