Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:CrewAIInc CrewAI Knowledge RAG Pipeline

From Leeroopedia
Knowledge Sources
Domains RAG, Knowledge_Management, Vector_Search, Multi_Agent_Systems
Last Updated 2026-02-11 18:00 GMT

Overview

End-to-end process for integrating domain-specific knowledge sources into CrewAI agents using Retrieval-Augmented Generation (RAG) with vector storage, embedding, and semantic search.

Description

This workflow covers how to augment CrewAI agents with external knowledge using the built-in RAG pipeline. The process involves selecting knowledge sources (PDF, text, CSV, JSON, web pages, or custom sources), configuring an embedding provider, ingesting content into a vector store (ChromaDB, Qdrant, or LanceDB), and attaching the knowledge system to agents or crews. During task execution, agents automatically query the knowledge base to retrieve relevant context that informs their responses. The system supports both crew-level and agent-level knowledge attachment, configurable embedding models, score-based retrieval thresholds, and direct knowledge querying via crew.query_knowledge().

Usage

Execute this workflow when agents need access to domain-specific information that is not part of their LLM training data. Typical triggers include: you have PDF documents, text files, or structured data that agents should reference; you need grounded, factual responses from agents rather than hallucinated content; or you want to build a question-answering system over proprietary documents.

Execution Steps

Step 1: Knowledge Source Selection

Choose the appropriate knowledge source classes based on the data format. CrewAI provides built-in sources for PDF files (PDFKnowledgeSource), plain text (TextKnowledgeSource), strings (StringKnowledgeSource), CSV files (CSVKnowledgeSource), JSON files (JSONKnowledgeSource), and custom sources via the BaseKnowledgeSource abstract class. Each source handles loading and parsing its content format into text chunks suitable for embedding.

Key considerations:

  • PDF source extracts text from all pages
  • CSV source preserves row structure in text representation
  • Custom sources extend BaseKnowledgeSource and implement load_content()
  • Multiple source types can be combined in a single Knowledge instance

Step 2: Embedding Configuration

Configure the embedding model that converts text chunks into vector representations. Set the embedder parameter with an EmbedderConfig specifying the provider (OpenAI, Google, Cohere, Azure, HuggingFace, etc.) and model name. The embedding model determines the vector dimensions and semantic quality of the similarity search. Ensure the embedding provider API key is set in the environment.

Key considerations:

  • OpenAI text-embedding-3-small is the default and recommended model
  • Embedding dimensions must match the vector store configuration
  • Provider-specific API keys must be available as environment variables
  • Custom embedding functions can be provided via the config

Step 3: Knowledge Ingestion

Create a Knowledge instance with the selected sources, embedder config, and a unique collection name. Call add_sources() to process all sources: text is extracted, chunked using configurable chunking strategies, embedded into vectors, and stored in the vector database. The collection name identifies this knowledge set for later retrieval.

Key considerations:

  • Chunking splits text into overlapping segments for better retrieval
  • The default vector store is ChromaDB (file-based, no server needed)
  • Qdrant and LanceDB are available as alternative backends
  • Content is deduplicated by hash to avoid duplicate embeddings

Step 4: Knowledge Attachment

Attach the Knowledge instance to a Crew via the knowledge_sources parameter or to individual Agent instances via their knowledge_sources parameter. Crew-level knowledge is shared across all agents in the crew. Agent-level knowledge is scoped to that specific agent. Both levels can coexist, with agent knowledge supplementing crew knowledge.

Key considerations:

  • Crew-level attachment shares knowledge with all agents
  • Agent-level attachment scopes knowledge to specific agents
  • Both levels use the same embedding and storage configuration
  • Knowledge is loaded during crew initialization before task execution

Step 5: Semantic Retrieval During Execution

When agents execute tasks, the system automatically queries the knowledge base with the task description and relevant context. Retrieved chunks are ranked by cosine similarity score and filtered by a configurable threshold (default 0.35 for crew, 0.6 for direct queries). The top results are injected into the agent's prompt as additional context, grounding responses in the source material.

Key considerations:

  • Retrieval is automatic during task execution when knowledge is attached
  • results_limit controls the maximum number of chunks returned
  • score_threshold filters out low-relevance results
  • The agent sees retrieved context as part of its task prompt

Step 6: Direct Knowledge Querying

Use crew.query_knowledge() or crew.aquery_knowledge() for programmatic access to the knowledge base outside of task execution. This enables building search interfaces, validating knowledge coverage, or pre-fetching context for custom pipelines. The query returns SearchResult objects with content, score, and metadata.

Key considerations:

  • Direct queries bypass the agent execution pipeline
  • Useful for debugging knowledge quality and coverage
  • reset() clears the vector store for re-ingestion
  • Async variant available for non-blocking queries

Execution Diagram

GitHub URL

Workflow Repository