Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:CrewAIInc CrewAI RAG Core

From Leeroopedia
Knowledge Sources
Domains RAG, Data_Loading
Last Updated 2026-02-11 00:00 GMT

Overview

Implements the core RAG (Retrieval-Augmented Generation) system that manages document ingestion, embedding generation, vector storage in ChromaDB, and semantic similarity search.

Description

The module defines two classes: Document (a Pydantic model for individual document chunks) and RAG (the main orchestrator extending the Adapter base class).

The RAG class initializes a ChromaDB client (either in-memory or persistent) with a cosine-similarity collection, and an EmbeddingService configured for a given provider and model (defaulting to OpenAI's text-embedding-3-large). It provides three core operations:

add() ingests content from any supported source. It resolves the appropriate DataType, obtains or uses a provided loader and chunker, loads the content, chunks it, generates embeddings in batch, and stores everything in ChromaDB. It implements deduplication by comparing document IDs (SHA-256 hashes of content) and replaces stale documents when the same source has been updated.

query() performs semantic search by embedding the question, querying ChromaDB for the top-k most similar documents, and formatting results with source attribution and relevance scores (converted from cosine distance to similarity).

delete_collection() and get_collection_info() provide collection management capabilities.

Usage

Import the RAG class when you need to build or query a knowledge base. It is the primary adapter used by RAGTool to give CrewAI agents access to document retrieval capabilities.

Code Reference

Source Location

  • Repository: CrewAI
  • File: lib/crewai-tools/src/crewai_tools/rag/core.py
  • Lines: 1-231

Signature

class Document(BaseModel):
    id: str
    content: str
    metadata: dict[str, Any]
    data_type: DataType
    source: str | None

class RAG(Adapter):
    collection_name: str = "crewai_knowledge_base"
    persist_directory: str | None = None
    embedding_provider: str = "openai"
    embedding_model: str = "text-embedding-3-large"
    summarize: bool = False
    top_k: int = 5
    embedding_config: dict[str, Any]

    def add(
        self,
        content: str | Path,
        data_type: str | DataType | None = None,
        metadata: dict[str, Any] | None = None,
        loader: BaseLoader | None = None,
        chunker: BaseChunker | None = None,
        **kwargs: Any,
    ) -> None: ...

    def query(self, question: str, where: dict[str, Any] | None = None) -> str: ...

    def delete_collection(self) -> None: ...

    def get_collection_info(self) -> dict[str, Any]: ...

Import

from crewai_tools.rag.core import RAG, Document

I/O Contract

Inputs (RAG.__init__)

Name Type Required Description
collection_name str No ChromaDB collection name (default "crewai_knowledge_base")
persist_directory None No Path for persistent ChromaDB storage (None for in-memory)
embedding_provider str No Embedding provider name (default "openai")
embedding_model str No Embedding model name (default "text-embedding-3-large")
summarize bool No Whether to summarize content (default False)
top_k int No Number of results to return from queries (default 5)
embedding_config dict[str, Any] No Additional embedding service configuration

Inputs (RAG.add)

Name Type Required Description
content Path Yes Content to add: text, file path, or URL
data_type DataType | None No Data type override; auto-detected if not provided
metadata None No Additional metadata to attach to documents
loader None No Custom loader; auto-selected if not provided
chunker None No Custom chunker; auto-selected if not provided

Inputs (RAG.query)

Name Type Required Description
question str Yes The query string to search for
where None No Optional ChromaDB where filter for metadata

Outputs

Name Type Description
add() return None Documents are stored in ChromaDB; no return value
query() return str Formatted string with source attribution and relevance scores
get_collection_info() return dict[str, Any] Dictionary with name, count, and embedding_model keys

Usage Examples

Basic Usage

from crewai_tools.rag.core import RAG

# Create an in-memory RAG instance
rag = RAG()

# Add a PDF document
rag.add("/path/to/document.pdf")

# Add a web page with explicit data type
rag.add("https://example.com/article", data_type="website")

# Query the knowledge base
result = rag.query("What is the main topic of the document?")
print(result)

# Get collection info
info = rag.get_collection_info()
print(f"Documents in collection: {info['count']}")

Persistent Storage

from crewai_tools.rag.core import RAG

rag = RAG(
    collection_name="my_project_kb",
    persist_directory="/data/chromadb",
    embedding_model="text-embedding-3-small",
    top_k=10,
)

rag.add("/path/to/docs/", data_type="directory")
result = rag.query("How do I configure the system?")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment