Implementation:CrewAIInc CrewAI RAG Core

Knowledge Sources	CrewAI
Domains	RAG, Data_Loading
Last Updated	2026-02-11 00:00 GMT

Overview

Implements the core RAG (Retrieval-Augmented Generation) system that manages document ingestion, embedding generation, vector storage in ChromaDB, and semantic similarity search.

Description

The module defines two classes: Document (a Pydantic model for individual document chunks) and RAG (the main orchestrator extending the Adapter base class).

The RAG class initializes a ChromaDB client (either in-memory or persistent) with a cosine-similarity collection, and an EmbeddingService configured for a given provider and model (defaulting to OpenAI's text-embedding-3-large). It provides three core operations:

add() ingests content from any supported source. It resolves the appropriate DataType, obtains or uses a provided loader and chunker, loads the content, chunks it, generates embeddings in batch, and stores everything in ChromaDB. It implements deduplication by comparing document IDs (SHA-256 hashes of content) and replaces stale documents when the same source has been updated.

query() performs semantic search by embedding the question, querying ChromaDB for the top-k most similar documents, and formatting results with source attribution and relevance scores (converted from cosine distance to similarity).

delete_collection() and get_collection_info() provide collection management capabilities.

Usage

Import the RAG class when you need to build or query a knowledge base. It is the primary adapter used by RAGTool to give CrewAI agents access to document retrieval capabilities.

Code Reference

Source Location

Repository: CrewAI
File: lib/crewai-tools/src/crewai_tools/rag/core.py
Lines: 1-231

Signature

class Document(BaseModel):
    id: str
    content: str
    metadata: dict[str, Any]
    data_type: DataType
    source: str | None

class RAG(Adapter):
    collection_name: str = "crewai_knowledge_base"
    persist_directory: str | None = None
    embedding_provider: str = "openai"
    embedding_model: str = "text-embedding-3-large"
    summarize: bool = False
    top_k: int = 5
    embedding_config: dict[str, Any]

    def add(
        self,
        content: str | Path,
        data_type: str | DataType | None = None,
        metadata: dict[str, Any] | None = None,
        loader: BaseLoader | None = None,
        chunker: BaseChunker | None = None,
        **kwargs: Any,
    ) -> None: ...

    def query(self, question: str, where: dict[str, Any] | None = None) -> str: ...

    def delete_collection(self) -> None: ...

    def get_collection_info(self) -> dict[str, Any]: ...

Import

from crewai_tools.rag.core import RAG, Document

I/O Contract

Inputs (RAG.init)

Name	Type	Required	Description
collection_name	str	No	ChromaDB collection name (default "crewai_knowledge_base")
persist_directory	None	No	Path for persistent ChromaDB storage (None for in-memory)
embedding_provider	str	No	Embedding provider name (default "openai")
embedding_model	str	No	Embedding model name (default "text-embedding-3-large")
summarize	bool	No	Whether to summarize content (default False)
top_k	int	No	Number of results to return from queries (default 5)
embedding_config	dict[str, Any]	No	Additional embedding service configuration

Inputs (RAG.add)

Name	Type	Required	Description
content	Path	Yes	Content to add: text, file path, or URL
data_type	DataType \| None	No	Data type override; auto-detected if not provided
metadata	None	No	Additional metadata to attach to documents
loader	None	No	Custom loader; auto-selected if not provided
chunker	None	No	Custom chunker; auto-selected if not provided

Inputs (RAG.query)

Name	Type	Required	Description
question	str	Yes	The query string to search for
where	None	No	Optional ChromaDB where filter for metadata

Outputs

Name	Type	Description
add() return	None	Documents are stored in ChromaDB; no return value
query() return	str	Formatted string with source attribution and relevance scores
get_collection_info() return	dict[str, Any]	Dictionary with name, count, and embedding_model keys

Usage Examples

Basic Usage

from crewai_tools.rag.core import RAG

# Create an in-memory RAG instance
rag = RAG()

# Add a PDF document
rag.add("/path/to/document.pdf")

# Add a web page with explicit data type
rag.add("https://example.com/article", data_type="website")

# Query the knowledge base
result = rag.query("What is the main topic of the document?")
print(result)

# Get collection info
info = rag.get_collection_info()
print(f"Documents in collection: {info['count']}")

Persistent Storage

from crewai_tools.rag.core import RAG

rag = RAG(
    collection_name="my_project_kb",
    persist_directory="/data/chromadb",
    embedding_model="text-embedding-3-small",
    top_k=10,
)

rag.add("/path/to/docs/", data_type="directory")
result = rag.query("How do I configure the system?")

Related Pages

Principle:CrewAIInc_CrewAI_Semantic_Retrieval

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment