Implementation:CrewAIInc CrewAI Knowledge Source Classes

Metadata

Field	Value
Implementation Name	Knowledge Source Classes
Workflow	Knowledge_RAG_Pipeline
Category	Data Ingestion
Repository	crewAIInc/crewAI
Implements	Principle:CrewAIInc_CrewAI_Knowledge_Source_Selection

Overview

Concrete file-format-specific knowledge source classes for loading and chunking documents provided by the CrewAI knowledge subsystem. These classes inherit from BaseFileKnowledgeSource and implement format-specific parsing while sharing a common chunking and ingestion interface.

Source References

Class	Source File	Lines
PDFKnowledgeSource	src/crewai/knowledge/source/pdf_knowledge_source.py	L7-60
TextFileKnowledgeSource	src/crewai/knowledge/source/text_file_knowledge_source.py	L6-40
CSVKnowledgeSource	src/crewai/knowledge/source/csv_knowledge_source.py	L7-48

Signatures

class PDFKnowledgeSource(BaseFileKnowledgeSource):
    """Knowledge source for PDF documents."""
    def load_content(self) -> dict[Path, str]: ...
    def add(self) -> None: ...
    def _chunk_text(self, text: str) -> list[str]: ...

class TextFileKnowledgeSource(BaseFileKnowledgeSource):
    """Knowledge source for plain text files."""
    def load_content(self) -> dict[Path, str]: ...
    def add(self) -> None: ...
    def _chunk_text(self, text: str) -> list[str]: ...

class CSVKnowledgeSource(BaseFileKnowledgeSource):
    """Knowledge source for CSV files."""
    def load_content(self) -> dict[Path, str]: ...
    def add(self) -> None: ...
    def _chunk_text(self, text: str) -> list[str]: ...

BaseFileKnowledgeSource fields:

class BaseFileKnowledgeSource(BaseKnowledgeSource):
    file_paths: list[Path | str]
    chunk_size: int = 4000
    chunk_overlap: int = 200
    content: dict[Path, str] = {}

Import

from crewai.knowledge.source import PDFKnowledgeSource, TextFileKnowledgeSource, CSVKnowledgeSource

I/O Contract

Direction	Type	Description
Input	str]	List of file paths to load and parse
Input	`chunk_size: int`	Maximum characters per chunk (default: 4000)
Input	`chunk_overlap: int`	Overlapping characters between chunks (default: 200)
Output	Knowledge source instance	Object with loaded and chunked content, ready for embedding and storage

Method Details

load_content()

Reads files from disk and extracts text content. Returns a dictionary mapping file paths to their extracted text. Each source type uses a format-specific parser:

PDFKnowledgeSource -- Uses a PDF parsing library to extract text from each page
TextFileKnowledgeSource -- Reads file contents directly as UTF-8 text
CSVKnowledgeSource -- Reads CSV rows and converts them to a text representation

add()

Orchestrates the full ingestion pipeline for the source:

Calls load_content() to extract text from files
Calls _chunk_text() on the extracted text to produce chunks
Saves chunks to the knowledge storage via the storage backend

_chunk_text(text)

Splits a text string into overlapping chunks:

Divides text into segments of chunk_size characters
Each chunk overlaps with the previous by chunk_overlap characters
Returns a list of text chunk strings

Code Examples

Creating a PDF Knowledge Source

from crewai.knowledge.source import PDFKnowledgeSource

# Create a PDF source with custom chunking parameters
pdf_source = PDFKnowledgeSource(
    file_paths=["docs/product_manual.pdf", "docs/api_reference.pdf"],
    chunk_size=4000,
    chunk_overlap=200,
)

# Content is loaded lazily when add() is called
# pdf_source.add() triggers: load_content() -> _chunk_text() -> storage.save()

Creating a Text File Knowledge Source

from crewai.knowledge.source import TextFileKnowledgeSource

text_source = TextFileKnowledgeSource(
    file_paths=["notes/meeting_notes.txt", "notes/design_doc.md"],
    chunk_size=3000,
    chunk_overlap=150,
)

Creating a CSV Knowledge Source

from crewai.knowledge.source import CSVKnowledgeSource

csv_source = CSVKnowledgeSource(
    file_paths=["data/customers.csv"],
    chunk_size=2000,
    chunk_overlap=100,
)

Combining Multiple Sources

from crewai.knowledge.source import (
    PDFKnowledgeSource,
    TextFileKnowledgeSource,
    CSVKnowledgeSource,
)

sources = [
    PDFKnowledgeSource(file_paths=["manuals/guide.pdf"]),
    TextFileKnowledgeSource(file_paths=["docs/readme.txt"]),
    CSVKnowledgeSource(file_paths=["data/faq.csv"]),
]

# All sources share the same interface and can be passed to Knowledge()

Related Pages

Principle:CrewAIInc_CrewAI_Knowledge_Source_Selection -- The principle this implements
Implementation:CrewAIInc_CrewAI_Knowledge_Constructor -- Downstream: Knowledge orchestrator that consumes these sources
Implementation:CrewAIInc_CrewAI_Embedder_Config -- Embedding configuration used during ingestion
Environment:CrewAIInc_CrewAI_Optional_Provider_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment