Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:CrewAIInc CrewAI RAG PDF Loader

From Leeroopedia
Knowledge Sources
Domains RAG, Data_Loading
Last Updated 2026-02-11 00:00 GMT

Overview

Extracts text content from PDF files located on the local filesystem or accessible via HTTP/HTTPS URLs, annotating each page with its page number.

Description

PDFLoader extends BaseLoader to handle PDF document processing. It requires the pymupdf library, which is lazily imported at load time with a clear installation instruction if missing.

The loader supports two source types:

  • URLs: Downloads the PDF content as bytes using urllib.request.urlopen with a 30-second timeout, then opens with pymupdf.open(stream=pdf_bytes, filetype="pdf").
  • Local files: Opens directly with pymupdf.open(file_path) after verifying the file exists.

Text is extracted page by page using page.get_text(), with each page prefixed with "Page N:" for reference. Pages containing only whitespace are skipped. If no extractable text is found (e.g., scanned PDFs without OCR), a descriptive placeholder message is returned.

The returned LoaderResult includes metadata with the source path/URL, file name, file type ("pdf"), and the total num_pages count.

Usage

Import PDFLoader when you need to explicitly load PDF files. It is typically instantiated automatically by the DataType.PDF_FILE registry when .pdf files or PDF URLs are detected.

Code Reference

Source Location

  • Repository: CrewAI
  • File: lib/crewai-tools/src/crewai_tools/rag/loaders/pdf_loader.py
  • Lines: 1-113

Signature

class PDFLoader(BaseLoader):
    def load(self, source: SourceContent, **kwargs: Any) -> LoaderResult: ...

Import

from crewai_tools.rag.loaders.pdf_loader import PDFLoader

I/O Contract

Inputs

Name Type Required Description
source SourceContent Yes Wraps a PDF file path or HTTP/HTTPS URL
**kwargs Any No Additional keyword arguments (unused)

Outputs

Name Type Description
return LoaderResult Contains page-annotated text content; metadata includes source, file_name, file_type ("pdf"), and num_pages

Usage Examples

Basic Usage

from crewai_tools.rag.loaders.pdf_loader import PDFLoader
from crewai_tools.rag.source_content import SourceContent

loader = PDFLoader()

# Load from a local file
source = SourceContent("/path/to/document.pdf")
result = loader.load(source)
print(result.content)
# Page 1:
# Introduction to the topic...
#
# Page 2:
# More detailed content...

print(result.metadata)
# {'source': '/path/to/document.pdf', 'file_name': 'document.pdf', 'file_type': 'pdf', 'num_pages': 15}

# Load from a URL
source = SourceContent("https://example.com/papers/research.pdf")
result = loader.load(source)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment