Implementation:CrewAIInc CrewAI Knowledge Source Classes
Appearance
Metadata
| Field | Value |
|---|---|
| Implementation Name | Knowledge Source Classes |
| Workflow | Knowledge_RAG_Pipeline |
| Category | Data Ingestion |
| Repository | crewAIInc/crewAI |
| Implements | Principle:CrewAIInc_CrewAI_Knowledge_Source_Selection |
Overview
Concrete file-format-specific knowledge source classes for loading and chunking documents provided by the CrewAI knowledge subsystem. These classes inherit from BaseFileKnowledgeSource and implement format-specific parsing while sharing a common chunking and ingestion interface.
Source References
| Class | Source File | Lines |
|---|---|---|
| PDFKnowledgeSource | src/crewai/knowledge/source/pdf_knowledge_source.py | L7-60 |
| TextFileKnowledgeSource | src/crewai/knowledge/source/text_file_knowledge_source.py | L6-40 |
| CSVKnowledgeSource | src/crewai/knowledge/source/csv_knowledge_source.py | L7-48 |
Signatures
class PDFKnowledgeSource(BaseFileKnowledgeSource):
"""Knowledge source for PDF documents."""
def load_content(self) -> dict[Path, str]: ...
def add(self) -> None: ...
def _chunk_text(self, text: str) -> list[str]: ...
class TextFileKnowledgeSource(BaseFileKnowledgeSource):
"""Knowledge source for plain text files."""
def load_content(self) -> dict[Path, str]: ...
def add(self) -> None: ...
def _chunk_text(self, text: str) -> list[str]: ...
class CSVKnowledgeSource(BaseFileKnowledgeSource):
"""Knowledge source for CSV files."""
def load_content(self) -> dict[Path, str]: ...
def add(self) -> None: ...
def _chunk_text(self, text: str) -> list[str]: ...
BaseFileKnowledgeSource fields:
class BaseFileKnowledgeSource(BaseKnowledgeSource):
file_paths: list[Path | str]
chunk_size: int = 4000
chunk_overlap: int = 200
content: dict[Path, str] = {}
Import
from crewai.knowledge.source import PDFKnowledgeSource, TextFileKnowledgeSource, CSVKnowledgeSource
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | str] | List of file paths to load and parse |
| Input | chunk_size: int |
Maximum characters per chunk (default: 4000) |
| Input | chunk_overlap: int |
Overlapping characters between chunks (default: 200) |
| Output | Knowledge source instance | Object with loaded and chunked content, ready for embedding and storage |
Method Details
load_content()
Reads files from disk and extracts text content. Returns a dictionary mapping file paths to their extracted text. Each source type uses a format-specific parser:
- PDFKnowledgeSource -- Uses a PDF parsing library to extract text from each page
- TextFileKnowledgeSource -- Reads file contents directly as UTF-8 text
- CSVKnowledgeSource -- Reads CSV rows and converts them to a text representation
add()
Orchestrates the full ingestion pipeline for the source:
- Calls
load_content()to extract text from files - Calls
_chunk_text()on the extracted text to produce chunks - Saves chunks to the knowledge storage via the storage backend
_chunk_text(text)
Splits a text string into overlapping chunks:
- Divides text into segments of
chunk_sizecharacters - Each chunk overlaps with the previous by
chunk_overlapcharacters - Returns a list of text chunk strings
Code Examples
Creating a PDF Knowledge Source
from crewai.knowledge.source import PDFKnowledgeSource
# Create a PDF source with custom chunking parameters
pdf_source = PDFKnowledgeSource(
file_paths=["docs/product_manual.pdf", "docs/api_reference.pdf"],
chunk_size=4000,
chunk_overlap=200,
)
# Content is loaded lazily when add() is called
# pdf_source.add() triggers: load_content() -> _chunk_text() -> storage.save()
Creating a Text File Knowledge Source
from crewai.knowledge.source import TextFileKnowledgeSource
text_source = TextFileKnowledgeSource(
file_paths=["notes/meeting_notes.txt", "notes/design_doc.md"],
chunk_size=3000,
chunk_overlap=150,
)
Creating a CSV Knowledge Source
from crewai.knowledge.source import CSVKnowledgeSource
csv_source = CSVKnowledgeSource(
file_paths=["data/customers.csv"],
chunk_size=2000,
chunk_overlap=100,
)
Combining Multiple Sources
from crewai.knowledge.source import (
PDFKnowledgeSource,
TextFileKnowledgeSource,
CSVKnowledgeSource,
)
sources = [
PDFKnowledgeSource(file_paths=["manuals/guide.pdf"]),
TextFileKnowledgeSource(file_paths=["docs/readme.txt"]),
CSVKnowledgeSource(file_paths=["data/faq.csv"]),
]
# All sources share the same interface and can be passed to Knowledge()
Related Pages
- Principle:CrewAIInc_CrewAI_Knowledge_Source_Selection -- The principle this implements
- Implementation:CrewAIInc_CrewAI_Knowledge_Constructor -- Downstream: Knowledge orchestrator that consumes these sources
- Implementation:CrewAIInc_CrewAI_Embedder_Config -- Embedding configuration used during ingestion
- Environment:CrewAIInc_CrewAI_Optional_Provider_Dependencies
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment