Principle:CrewAIInc CrewAI Knowledge Source Selection
Metadata
| Field | Value |
|---|---|
| Principle Name | Knowledge Source Selection |
| Workflow | Knowledge_RAG_Pipeline |
| Category | Data Ingestion |
| Repository | crewAIInc/crewAI |
| Implemented By | Implementation:CrewAIInc_CrewAI_Knowledge_Source_Classes |
Overview
A data ingestion pattern for loading, parsing, and chunking domain-specific documents from various file formats into a format suitable for embedding and vector storage. Knowledge Source Selection addresses the first critical step in any Retrieval-Augmented Generation (RAG) pipeline: converting raw documents into processable text chunks.
Description
Knowledge Source Selection addresses the first step in any RAG pipeline: getting data from its source format into text chunks. Different source types (PDF, text, CSV, JSON, URLs) require different parsers. Each source handles loading, text extraction, chunking (with configurable size and overlap), and preparation for embedding. The pattern decouples source-specific parsing from the generic embedding and storage pipeline.
The key insight is that while downstream processing (embedding, storage, retrieval) is identical regardless of the original document format, the extraction step is inherently format-specific. A PDF requires a PDF parser, a CSV requires tabular parsing, and a URL requires HTTP fetching and HTML stripping. By encapsulating each format behind a common interface, the rest of the pipeline can operate uniformly on text chunks.
The chunking process itself is configurable via two parameters:
- chunk_size -- the maximum number of characters per chunk (default: 4000)
- chunk_overlap -- the number of overlapping characters between consecutive chunks (default: 200)
Overlap ensures that information spanning chunk boundaries is not lost during retrieval.
Theoretical Basis
This pattern applies the Extract-Transform-Load (ETL) paradigm to knowledge management:
- Extract -- Documents are extracted from their source format (PDF, CSV, TXT, JSON, URLs) using format-specific parsers.
- Transform -- Extracted text is transformed into uniformly sized chunks with configurable overlap, producing a list of text segments suitable for embedding.
- Load -- Chunks are loaded into a vector storage backend after embedding (handled by the downstream Knowledge Ingestion step).
The chunking strategy follows established information retrieval practices where documents are segmented into passages of manageable size. Overlapping windows prevent the loss of context at chunk boundaries, a well-known issue in passage retrieval systems.
Supported Source Types
| Source Type | File Extensions | Parser |
|---|---|---|
| PDF Documents | PyPDF2 / pdfplumber | |
| Text Files | .txt, .md, .rst | Native file read |
| CSV Files | .csv | Python csv module |
| JSON Files | .json | Python json module |
| Excel Files | .xlsx, .xls | openpyxl |
| URLs | HTTP/HTTPS | HTTP fetch + HTML strip |
| String Literals | N/A | Direct text input |
Usage Context
Knowledge Source Selection is the entry point of the Knowledge RAG Pipeline. It must be completed before any embedding or storage can occur. The typical workflow is:
- Select and configure knowledge sources (this principle)
- Configure the embedding provider (see Principle:CrewAIInc_CrewAI_Embedding_Configuration)
- Ingest sources into vector storage (see Principle:CrewAIInc_CrewAI_Knowledge_Ingestion)
- Attach knowledge to a Crew or Agent (see Principle:CrewAIInc_CrewAI_Knowledge_Attachment)
- Retrieve relevant chunks during task execution (see Principle:CrewAIInc_CrewAI_Semantic_Retrieval)
Design Decisions
- Polymorphic source interface -- All source types implement the same base interface (
load_content(),add(),_chunk_text()), allowing the pipeline to treat all sources uniformly. - Configurable chunking -- Chunk size and overlap are exposed as parameters rather than hardcoded, enabling tuning for different use cases (small chunks for precise retrieval, large chunks for broader context).
- Lazy loading -- Content is loaded on demand when
add()is called, not at construction time, reducing memory usage when sources are configured but not yet needed.
Example Scenario
Consider a customer support application that needs to ground its responses in product documentation. The documentation exists as a set of PDF manuals and a CSV of FAQ entries:
from crewai.knowledge.source import PDFKnowledgeSource, CSVKnowledgeSource
# Configure sources with appropriate chunking for support context
pdf_source = PDFKnowledgeSource(
file_paths=["manuals/product_guide.pdf", "manuals/troubleshooting.pdf"],
chunk_size=4000,
chunk_overlap=200,
)
csv_source = CSVKnowledgeSource(
file_paths=["data/faq_entries.csv"],
chunk_size=2000,
chunk_overlap=100,
)
# Both sources now expose the same interface for downstream processing
sources = [pdf_source, csv_source]
Related Pages
- Implementation:CrewAIInc_CrewAI_Knowledge_Source_Classes -- Concrete source class implementations
- Principle:CrewAIInc_CrewAI_Embedding_Configuration -- Next step: embedding provider configuration
- Principle:CrewAIInc_CrewAI_Knowledge_Ingestion -- Next step: vector store ingestion