Implementation:CrewAIInc CrewAI RAG DOCX Loader
| Knowledge Sources | |
|---|---|
| Domains | RAG, Data_Loading |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Loads and extracts text content from Microsoft Word DOCX files, supporting both local file paths and remote URLs.
Description
DOCXLoader extends BaseLoader to handle DOCX document processing. It requires the python-docx library, which is lazily imported at load time to provide a clear error message if not installed.
For URL sources, the loader downloads the DOCX file to a temporary file using requests with appropriate MIME type headers, processes it, and ensures cleanup of the temporary file via a try/finally block. For local files, it loads directly from the file path.
The _load_from_file() method uses python-docx to open the document, iterates through all paragraphs, filters out empty ones, and joins them with newline separators. The resulting LoaderResult includes metadata about the format, total paragraph count, and table count in the document.
Usage
Import DOCXLoader when you need to explicitly load DOCX files. It is typically instantiated automatically by the DataType.DOCX registry when .docx files are detected.
Code Reference
Source Location
- Repository: CrewAI
- File: lib/crewai-tools/src/crewai_tools/rag/loaders/docx_loader.py
- Lines: 1-86
Signature
class DOCXLoader(BaseLoader):
def load(self, source_content: SourceContent, **kwargs) -> LoaderResult: ...
Import
from crewai_tools.rag.loaders.docx_loader import DOCXLoader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| source_content | SourceContent | Yes | Wraps a DOCX file path or URL |
| **kwargs | Any | No | Additional keyword arguments; headers can be provided for URL downloads |
Outputs
| Name | Type | Description |
|---|---|---|
| return | LoaderResult | Contains extracted paragraph text, source reference, and metadata (format, paragraphs count, tables count) |
Usage Examples
Basic Usage
from crewai_tools.rag.loaders.docx_loader import DOCXLoader
from crewai_tools.rag.source_content import SourceContent
loader = DOCXLoader()
# Load from a local file
source = SourceContent("/path/to/report.docx")
result = loader.load(source)
print(result.content) # Extracted text from all paragraphs
print(result.metadata)
# {'format': 'docx', 'paragraphs': 42, 'tables': 3}
# Load from a URL
source = SourceContent("https://example.com/files/report.docx")
result = loader.load(source)