Implementation:CrewAIInc CrewAI RAG Docs Site Loader

Knowledge Sources	CrewAI
Domains	RAG, Data_Loading, Web_Scraping
Last Updated	2026-02-11 00:00 GMT

Overview

Extracts and structures content from documentation websites by intelligently parsing HTML to identify main content areas, headings, and navigation links.

Description

DocsSiteLoader extends BaseLoader to fetch and process documentation websites. It uses requests to fetch HTML content (with a 30-second timeout) and BeautifulSoup for parsing.

The extraction process follows a multi-step strategy:

1. Script/style removal: All script and style elements are decomposed from the DOM. 2. Title extraction: The page title is extracted from the <title> tag. 3. Main content detection: The loader tries multiple CSS selectors in order: main, article, [role="main"], .content, #content, .documentation. If none match, it falls back to the <body> element. 4. Table of contents generation: Headings (h1-h3, limited to 15) are extracted with indentation reflecting their hierarchy level. 5. Text extraction: All text from the main content area is extracted with newline separators and whitespace cleanup. 6. Navigation links: Links from nav, .sidebar, .toc, and .navigation elements are collected (up to 20), with relative URLs resolved to absolute using urljoin.

Content is truncated at 100KB to prevent excessive memory usage. Metadata includes the source URL, page title, and domain name.

Usage

Import DocsSiteLoader when you need to load content from documentation websites. It is typically instantiated automatically by the DataType.DOCS_SITE registry when URLs containing "docs" in the hostname or path are detected.

Code Reference

Source Location

Repository: CrewAI
File: lib/crewai-tools/src/crewai_tools/rag/loaders/docs_site_loader.py
Lines: 1-109

Signature

class DocsSiteLoader(BaseLoader):
    def load(self, source: SourceContent, **kwargs) -> LoaderResult: ...

Import

from crewai_tools.rag.loaders.docs_site_loader import DocsSiteLoader

I/O Contract

Inputs

Name	Type	Required	Description
source	SourceContent	Yes	Wraps a documentation site URL
**kwargs	Any	No	Additional keyword arguments (unused)

Outputs

Name	Type	Description
return	LoaderResult	Contains structured text with title, table of contents, main content, and navigation links; metadata includes source URL, title, and domain

Usage Examples

Basic Usage

from crewai_tools.rag.loaders.docs_site_loader import DocsSiteLoader
from crewai_tools.rag.source_content import SourceContent

loader = DocsSiteLoader()

source = SourceContent("https://docs.example.com/getting-started")
result = loader.load(source)

print(result.content)
# Title: Getting Started Guide
#
# Table of Contents:
# - Introduction
#   - Prerequisites
#   - Installation
# ...

print(result.metadata)
# {'source': 'https://docs.example.com/getting-started', 'title': 'Getting Started Guide', 'domain': 'docs.example.com'}

Related Pages

Principle:CrewAIInc_CrewAI_Knowledge_Ingestion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment