Implementation:CrewAIInc CrewAI RAG PDF Loader
| Knowledge Sources | |
|---|---|
| Domains | RAG, Data_Loading |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Extracts text content from PDF files located on the local filesystem or accessible via HTTP/HTTPS URLs, annotating each page with its page number.
Description
PDFLoader extends BaseLoader to handle PDF document processing. It requires the pymupdf library, which is lazily imported at load time with a clear installation instruction if missing.
The loader supports two source types:
- URLs: Downloads the PDF content as bytes using urllib.request.urlopen with a 30-second timeout, then opens with pymupdf.open(stream=pdf_bytes, filetype="pdf").
- Local files: Opens directly with pymupdf.open(file_path) after verifying the file exists.
Text is extracted page by page using page.get_text(), with each page prefixed with "Page N:" for reference. Pages containing only whitespace are skipped. If no extractable text is found (e.g., scanned PDFs without OCR), a descriptive placeholder message is returned.
The returned LoaderResult includes metadata with the source path/URL, file name, file type ("pdf"), and the total num_pages count.
Usage
Import PDFLoader when you need to explicitly load PDF files. It is typically instantiated automatically by the DataType.PDF_FILE registry when .pdf files or PDF URLs are detected.
Code Reference
Source Location
- Repository: CrewAI
- File: lib/crewai-tools/src/crewai_tools/rag/loaders/pdf_loader.py
- Lines: 1-113
Signature
class PDFLoader(BaseLoader):
def load(self, source: SourceContent, **kwargs: Any) -> LoaderResult: ...
Import
from crewai_tools.rag.loaders.pdf_loader import PDFLoader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| source | SourceContent | Yes | Wraps a PDF file path or HTTP/HTTPS URL |
| **kwargs | Any | No | Additional keyword arguments (unused) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | LoaderResult | Contains page-annotated text content; metadata includes source, file_name, file_type ("pdf"), and num_pages |
Usage Examples
Basic Usage
from crewai_tools.rag.loaders.pdf_loader import PDFLoader
from crewai_tools.rag.source_content import SourceContent
loader = PDFLoader()
# Load from a local file
source = SourceContent("/path/to/document.pdf")
result = loader.load(source)
print(result.content)
# Page 1:
# Introduction to the topic...
#
# Page 2:
# More detailed content...
print(result.metadata)
# {'source': '/path/to/document.pdf', 'file_name': 'document.pdf', 'file_type': 'pdf', 'num_pages': 15}
# Load from a URL
source = SourceContent("https://example.com/papers/research.pdf")
result = loader.load(source)