Implementation:CrewAIInc CrewAI RAG Core
| Knowledge Sources | |
|---|---|
| Domains | RAG, Data_Loading |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Implements the core RAG (Retrieval-Augmented Generation) system that manages document ingestion, embedding generation, vector storage in ChromaDB, and semantic similarity search.
Description
The module defines two classes: Document (a Pydantic model for individual document chunks) and RAG (the main orchestrator extending the Adapter base class).
The RAG class initializes a ChromaDB client (either in-memory or persistent) with a cosine-similarity collection, and an EmbeddingService configured for a given provider and model (defaulting to OpenAI's text-embedding-3-large). It provides three core operations:
add() ingests content from any supported source. It resolves the appropriate DataType, obtains or uses a provided loader and chunker, loads the content, chunks it, generates embeddings in batch, and stores everything in ChromaDB. It implements deduplication by comparing document IDs (SHA-256 hashes of content) and replaces stale documents when the same source has been updated.
query() performs semantic search by embedding the question, querying ChromaDB for the top-k most similar documents, and formatting results with source attribution and relevance scores (converted from cosine distance to similarity).
delete_collection() and get_collection_info() provide collection management capabilities.
Usage
Import the RAG class when you need to build or query a knowledge base. It is the primary adapter used by RAGTool to give CrewAI agents access to document retrieval capabilities.
Code Reference
Source Location
- Repository: CrewAI
- File: lib/crewai-tools/src/crewai_tools/rag/core.py
- Lines: 1-231
Signature
class Document(BaseModel):
id: str
content: str
metadata: dict[str, Any]
data_type: DataType
source: str | None
class RAG(Adapter):
collection_name: str = "crewai_knowledge_base"
persist_directory: str | None = None
embedding_provider: str = "openai"
embedding_model: str = "text-embedding-3-large"
summarize: bool = False
top_k: int = 5
embedding_config: dict[str, Any]
def add(
self,
content: str | Path,
data_type: str | DataType | None = None,
metadata: dict[str, Any] | None = None,
loader: BaseLoader | None = None,
chunker: BaseChunker | None = None,
**kwargs: Any,
) -> None: ...
def query(self, question: str, where: dict[str, Any] | None = None) -> str: ...
def delete_collection(self) -> None: ...
def get_collection_info(self) -> dict[str, Any]: ...
Import
from crewai_tools.rag.core import RAG, Document
I/O Contract
Inputs (RAG.__init__)
| Name | Type | Required | Description |
|---|---|---|---|
| collection_name | str | No | ChromaDB collection name (default "crewai_knowledge_base") |
| persist_directory | None | No | Path for persistent ChromaDB storage (None for in-memory) |
| embedding_provider | str | No | Embedding provider name (default "openai") |
| embedding_model | str | No | Embedding model name (default "text-embedding-3-large") |
| summarize | bool | No | Whether to summarize content (default False) |
| top_k | int | No | Number of results to return from queries (default 5) |
| embedding_config | dict[str, Any] | No | Additional embedding service configuration |
Inputs (RAG.add)
| Name | Type | Required | Description |
|---|---|---|---|
| content | Path | Yes | Content to add: text, file path, or URL |
| data_type | DataType | None | No | Data type override; auto-detected if not provided |
| metadata | None | No | Additional metadata to attach to documents |
| loader | None | No | Custom loader; auto-selected if not provided |
| chunker | None | No | Custom chunker; auto-selected if not provided |
Inputs (RAG.query)
| Name | Type | Required | Description |
|---|---|---|---|
| question | str | Yes | The query string to search for |
| where | None | No | Optional ChromaDB where filter for metadata |
Outputs
| Name | Type | Description |
|---|---|---|
| add() return | None | Documents are stored in ChromaDB; no return value |
| query() return | str | Formatted string with source attribution and relevance scores |
| get_collection_info() return | dict[str, Any] | Dictionary with name, count, and embedding_model keys |
Usage Examples
Basic Usage
from crewai_tools.rag.core import RAG
# Create an in-memory RAG instance
rag = RAG()
# Add a PDF document
rag.add("/path/to/document.pdf")
# Add a web page with explicit data type
rag.add("https://example.com/article", data_type="website")
# Query the knowledge base
result = rag.query("What is the main topic of the document?")
print(result)
# Get collection info
info = rag.get_collection_info()
print(f"Documents in collection: {info['count']}")
Persistent Storage
from crewai_tools.rag.core import RAG
rag = RAG(
collection_name="my_project_kb",
persist_directory="/data/chromadb",
embedding_model="text-embedding-3-small",
top_k=10,
)
rag.add("/path/to/docs/", data_type="directory")
result = rag.query("How do I configure the system?")