Implementation:CrewAIInc CrewAI RAG PDF Loader

Knowledge Sources	CrewAI
Domains	RAG, Data_Loading
Last Updated	2026-02-11 00:00 GMT

Overview

Extracts text content from PDF files located on the local filesystem or accessible via HTTP/HTTPS URLs, annotating each page with its page number.

Description

PDFLoader extends BaseLoader to handle PDF document processing. It requires the pymupdf library, which is lazily imported at load time with a clear installation instruction if missing.

The loader supports two source types:

URLs: Downloads the PDF content as bytes using urllib.request.urlopen with a 30-second timeout, then opens with pymupdf.open(stream=pdf_bytes, filetype="pdf").
Local files: Opens directly with pymupdf.open(file_path) after verifying the file exists.

Text is extracted page by page using page.get_text(), with each page prefixed with "Page N:" for reference. Pages containing only whitespace are skipped. If no extractable text is found (e.g., scanned PDFs without OCR), a descriptive placeholder message is returned.

The returned LoaderResult includes metadata with the source path/URL, file name, file type ("pdf"), and the total num_pages count.

Usage

Import PDFLoader when you need to explicitly load PDF files. It is typically instantiated automatically by the DataType.PDF_FILE registry when .pdf files or PDF URLs are detected.

Code Reference

Source Location

Repository: CrewAI
File: lib/crewai-tools/src/crewai_tools/rag/loaders/pdf_loader.py
Lines: 1-113

Signature

class PDFLoader(BaseLoader):
    def load(self, source: SourceContent, **kwargs: Any) -> LoaderResult: ...

Import

from crewai_tools.rag.loaders.pdf_loader import PDFLoader

I/O Contract

Inputs

Name	Type	Required	Description
source	SourceContent	Yes	Wraps a PDF file path or HTTP/HTTPS URL
**kwargs	Any	No	Additional keyword arguments (unused)

Outputs

Name	Type	Description
return	LoaderResult	Contains page-annotated text content; metadata includes source, file_name, file_type ("pdf"), and num_pages

Usage Examples

Basic Usage

from crewai_tools.rag.loaders.pdf_loader import PDFLoader
from crewai_tools.rag.source_content import SourceContent

loader = PDFLoader()

# Load from a local file
source = SourceContent("/path/to/document.pdf")
result = loader.load(source)
print(result.content)
# Page 1:
# Introduction to the topic...
#
# Page 2:
# More detailed content...

print(result.metadata)
# {'source': '/path/to/document.pdf', 'file_name': 'document.pdf', 'file_type': 'pdf', 'num_pages': 15}

# Load from a URL
source = SourceContent("https://example.com/papers/research.pdf")
result = loader.load(source)

Related Pages

Principle:CrewAIInc_CrewAI_Knowledge_Ingestion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment