Principle:CrewAIInc CrewAI Knowledge Source Selection

Metadata

Field	Value
Principle Name	Knowledge Source Selection
Workflow	Knowledge_RAG_Pipeline
Category	Data Ingestion
Repository	crewAIInc/crewAI
Implemented By	Implementation:CrewAIInc_CrewAI_Knowledge_Source_Classes

Overview

A data ingestion pattern for loading, parsing, and chunking domain-specific documents from various file formats into a format suitable for embedding and vector storage. Knowledge Source Selection addresses the first critical step in any Retrieval-Augmented Generation (RAG) pipeline: converting raw documents into processable text chunks.

Description

Knowledge Source Selection addresses the first step in any RAG pipeline: getting data from its source format into text chunks. Different source types (PDF, text, CSV, JSON, URLs) require different parsers. Each source handles loading, text extraction, chunking (with configurable size and overlap), and preparation for embedding. The pattern decouples source-specific parsing from the generic embedding and storage pipeline.

The key insight is that while downstream processing (embedding, storage, retrieval) is identical regardless of the original document format, the extraction step is inherently format-specific. A PDF requires a PDF parser, a CSV requires tabular parsing, and a URL requires HTTP fetching and HTML stripping. By encapsulating each format behind a common interface, the rest of the pipeline can operate uniformly on text chunks.

The chunking process itself is configurable via two parameters:

chunk_size -- the maximum number of characters per chunk (default: 4000)
chunk_overlap -- the number of overlapping characters between consecutive chunks (default: 200)

Overlap ensures that information spanning chunk boundaries is not lost during retrieval.

Theoretical Basis

This pattern applies the Extract-Transform-Load (ETL) paradigm to knowledge management:

Extract -- Documents are extracted from their source format (PDF, CSV, TXT, JSON, URLs) using format-specific parsers.
Transform -- Extracted text is transformed into uniformly sized chunks with configurable overlap, producing a list of text segments suitable for embedding.
Load -- Chunks are loaded into a vector storage backend after embedding (handled by the downstream Knowledge Ingestion step).

The chunking strategy follows established information retrieval practices where documents are segmented into passages of manageable size. Overlapping windows prevent the loss of context at chunk boundaries, a well-known issue in passage retrieval systems.

Supported Source Types

Source Type	File Extensions	Parser
PDF Documents	.pdf	PyPDF2 / pdfplumber
Text Files	.txt, .md, .rst	Native file read
CSV Files	.csv	Python csv module
JSON Files	.json	Python json module
Excel Files	.xlsx, .xls	openpyxl
URLs	HTTP/HTTPS	HTTP fetch + HTML strip
String Literals	N/A	Direct text input

Usage Context

Knowledge Source Selection is the entry point of the Knowledge RAG Pipeline. It must be completed before any embedding or storage can occur. The typical workflow is:

Select and configure knowledge sources (this principle)
Configure the embedding provider (see Principle:CrewAIInc_CrewAI_Embedding_Configuration)
Ingest sources into vector storage (see Principle:CrewAIInc_CrewAI_Knowledge_Ingestion)
Attach knowledge to a Crew or Agent (see Principle:CrewAIInc_CrewAI_Knowledge_Attachment)
Retrieve relevant chunks during task execution (see Principle:CrewAIInc_CrewAI_Semantic_Retrieval)

Design Decisions

Polymorphic source interface -- All source types implement the same base interface (load_content(), add(), _chunk_text()), allowing the pipeline to treat all sources uniformly.
Configurable chunking -- Chunk size and overlap are exposed as parameters rather than hardcoded, enabling tuning for different use cases (small chunks for precise retrieval, large chunks for broader context).
Lazy loading -- Content is loaded on demand when add() is called, not at construction time, reducing memory usage when sources are configured but not yet needed.

Example Scenario

Consider a customer support application that needs to ground its responses in product documentation. The documentation exists as a set of PDF manuals and a CSV of FAQ entries:

from crewai.knowledge.source import PDFKnowledgeSource, CSVKnowledgeSource

# Configure sources with appropriate chunking for support context
pdf_source = PDFKnowledgeSource(
    file_paths=["manuals/product_guide.pdf", "manuals/troubleshooting.pdf"],
    chunk_size=4000,
    chunk_overlap=200,
)

csv_source = CSVKnowledgeSource(
    file_paths=["data/faq_entries.csv"],
    chunk_size=2000,
    chunk_overlap=100,
)

# Both sources now expose the same interface for downstream processing
sources = [pdf_source, csv_source]

Related Pages

Implementation:CrewAIInc_CrewAI_Knowledge_Source_Classes -- Concrete source class implementations
Principle:CrewAIInc_CrewAI_Embedding_Configuration -- Next step: embedding provider configuration
Principle:CrewAIInc_CrewAI_Knowledge_Ingestion -- Next step: vector store ingestion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment