Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:CrewAIInc CrewAI RAG DOCX Loader

From Leeroopedia
Revision as of 11:08, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/CrewAIInc_CrewAI_RAG_DOCX_Loader.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains RAG, Data_Loading
Last Updated 2026-02-11 00:00 GMT

Overview

Loads and extracts text content from Microsoft Word DOCX files, supporting both local file paths and remote URLs.

Description

DOCXLoader extends BaseLoader to handle DOCX document processing. It requires the python-docx library, which is lazily imported at load time to provide a clear error message if not installed.

For URL sources, the loader downloads the DOCX file to a temporary file using requests with appropriate MIME type headers, processes it, and ensures cleanup of the temporary file via a try/finally block. For local files, it loads directly from the file path.

The _load_from_file() method uses python-docx to open the document, iterates through all paragraphs, filters out empty ones, and joins them with newline separators. The resulting LoaderResult includes metadata about the format, total paragraph count, and table count in the document.

Usage

Import DOCXLoader when you need to explicitly load DOCX files. It is typically instantiated automatically by the DataType.DOCX registry when .docx files are detected.

Code Reference

Source Location

  • Repository: CrewAI
  • File: lib/crewai-tools/src/crewai_tools/rag/loaders/docx_loader.py
  • Lines: 1-86

Signature

class DOCXLoader(BaseLoader):
    def load(self, source_content: SourceContent, **kwargs) -> LoaderResult: ...

Import

from crewai_tools.rag.loaders.docx_loader import DOCXLoader

I/O Contract

Inputs

Name Type Required Description
source_content SourceContent Yes Wraps a DOCX file path or URL
**kwargs Any No Additional keyword arguments; headers can be provided for URL downloads

Outputs

Name Type Description
return LoaderResult Contains extracted paragraph text, source reference, and metadata (format, paragraphs count, tables count)

Usage Examples

Basic Usage

from crewai_tools.rag.loaders.docx_loader import DOCXLoader
from crewai_tools.rag.source_content import SourceContent

loader = DOCXLoader()

# Load from a local file
source = SourceContent("/path/to/report.docx")
result = loader.load(source)
print(result.content)       # Extracted text from all paragraphs
print(result.metadata)
# {'format': 'docx', 'paragraphs': 42, 'tables': 3}

# Load from a URL
source = SourceContent("https://example.com/files/report.docx")
result = loader.load(source)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment