Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:CrewAIInc CrewAI RAG Docs Site Loader

From Leeroopedia
Revision as of 11:08, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/CrewAIInc_CrewAI_RAG_Docs_Site_Loader.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains RAG, Data_Loading, Web_Scraping
Last Updated 2026-02-11 00:00 GMT

Overview

Extracts and structures content from documentation websites by intelligently parsing HTML to identify main content areas, headings, and navigation links.

Description

DocsSiteLoader extends BaseLoader to fetch and process documentation websites. It uses requests to fetch HTML content (with a 30-second timeout) and BeautifulSoup for parsing.

The extraction process follows a multi-step strategy:

1. Script/style removal: All script and style elements are decomposed from the DOM. 2. Title extraction: The page title is extracted from the <title> tag. 3. Main content detection: The loader tries multiple CSS selectors in order: main, article, [role="main"], .content, #content, .documentation. If none match, it falls back to the <body> element. 4. Table of contents generation: Headings (h1-h3, limited to 15) are extracted with indentation reflecting their hierarchy level. 5. Text extraction: All text from the main content area is extracted with newline separators and whitespace cleanup. 6. Navigation links: Links from nav, .sidebar, .toc, and .navigation elements are collected (up to 20), with relative URLs resolved to absolute using urljoin.

Content is truncated at 100KB to prevent excessive memory usage. Metadata includes the source URL, page title, and domain name.

Usage

Import DocsSiteLoader when you need to load content from documentation websites. It is typically instantiated automatically by the DataType.DOCS_SITE registry when URLs containing "docs" in the hostname or path are detected.

Code Reference

Source Location

  • Repository: CrewAI
  • File: lib/crewai-tools/src/crewai_tools/rag/loaders/docs_site_loader.py
  • Lines: 1-109

Signature

class DocsSiteLoader(BaseLoader):
    def load(self, source: SourceContent, **kwargs) -> LoaderResult: ...

Import

from crewai_tools.rag.loaders.docs_site_loader import DocsSiteLoader

I/O Contract

Inputs

Name Type Required Description
source SourceContent Yes Wraps a documentation site URL
**kwargs Any No Additional keyword arguments (unused)

Outputs

Name Type Description
return LoaderResult Contains structured text with title, table of contents, main content, and navigation links; metadata includes source URL, title, and domain

Usage Examples

Basic Usage

from crewai_tools.rag.loaders.docs_site_loader import DocsSiteLoader
from crewai_tools.rag.source_content import SourceContent

loader = DocsSiteLoader()

source = SourceContent("https://docs.example.com/getting-started")
result = loader.load(source)

print(result.content)
# Title: Getting Started Guide
#
# Table of Contents:
# - Introduction
#   - Prerequisites
#   - Installation
# ...

print(result.metadata)
# {'source': 'https://docs.example.com/getting-started', 'title': 'Getting Started Guide', 'domain': 'docs.example.com'}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment