Principle:CrewAIInc CrewAI Knowledge Ingestion
Metadata
| Field | Value |
|---|---|
| Principle Name | Knowledge Ingestion |
| Workflow | Knowledge_RAG_Pipeline |
| Category | Vector Storage |
| Repository | crewAIInc/crewAI |
| Implemented By | Implementation:CrewAIInc_CrewAI_Knowledge_Constructor |
Overview
The orchestration step that takes configured knowledge sources, generates embeddings from their chunked content, and stores the resulting vectors in a persistent collection for later retrieval. Knowledge Ingestion bridges the gap between parsed document chunks and a queryable vector store.
Description
Knowledge Ingestion is the step that populates the vector store. It takes knowledge source instances (already parsed and chunked), an embedding configuration, and a collection name. The ingestion process generates embeddings for each text chunk and stores them in the vector database (ChromaDB by default, Qdrant optionally). Once ingested, the collection is queryable via semantic search.
The ingestion process follows these steps:
- A
Knowledgeobject is created with a list of sources, an embedder configuration, and a collection name - The
add_sources()method is called (or triggered automatically during initialization) - For each source,
source.add()is invoked, which:- Calls
load_content()to extract text from the source format - Calls
_chunk_text()to segment text into overlapping chunks - Passes chunks to the storage backend for embedding and persistence
- Calls
- The storage backend embeds each chunk using the configured embedding model
- Embedded vectors are stored in a named collection in the vector database
The result is a persistent, queryable collection of document vectors that can be searched semantically.
Theoretical Basis
Knowledge Ingestion implements vector indexing for approximate nearest neighbor (ANN) search. The theoretical foundation includes:
- Vector indexing -- Documents are embedded as dense vectors and stored in an index structure that supports efficient similarity queries. ChromaDB uses HNSW (Hierarchical Navigable Small World) graphs by default.
- Collection-based organization -- Each Knowledge object corresponds to a named collection in the vector store, enabling isolation between different knowledge domains.
- Idempotent ingestion -- The
reset()method allows clearing and re-ingesting a collection, supporting iterative development and knowledge updates.
The vector database serves as a semantic index that maps from query vectors to the most relevant document chunks, measured by cosine similarity or Euclidean distance.
Storage Backends
| Backend | Description | Default |
|---|---|---|
| ChromaDB | Lightweight embedded vector database | Yes (default) |
| Qdrant | Production-grade vector database with filtering | Optional |
Usage Context
Knowledge Ingestion is the third step in the Knowledge RAG Pipeline:
- Select and configure knowledge sources (see Principle:CrewAIInc_CrewAI_Knowledge_Source_Selection)
- Configure the embedding provider (see Principle:CrewAIInc_CrewAI_Embedding_Configuration)
- Ingest sources into vector storage (this principle)
- Attach knowledge to a Crew or Agent (see Principle:CrewAIInc_CrewAI_Knowledge_Attachment)
- Retrieve relevant chunks during task execution (see Principle:CrewAIInc_CrewAI_Semantic_Retrieval)
Design Decisions
- Automatic ingestion on attachment -- When knowledge sources are attached to a Crew or Agent, ingestion happens automatically during initialization. Users do not need to call
add_sources()explicitly in the typical workflow. - Named collections -- Each Knowledge object uses a collection name to isolate its vectors. This allows multiple knowledge domains to coexist in the same vector database.
- Pluggable storage -- The storage backend is injected into the Knowledge object, allowing users to swap ChromaDB for Qdrant or other backends without changing the rest of the pipeline.
- Reset capability -- The
reset()method clears all vectors in a collection, enabling clean re-ingestion when source documents change.
Example Scenario
An engineering team wants to build a knowledge base from their internal documentation:
from crewai.knowledge.knowledge import Knowledge
from crewai.knowledge.source import PDFKnowledgeSource, TextFileKnowledgeSource
# Step 1: Configure sources
sources = [
PDFKnowledgeSource(file_paths=["docs/architecture.pdf"]),
TextFileKnowledgeSource(file_paths=["docs/runbook.txt"]),
]
# Step 2: Configure embedder
embedder = {
"provider": "openai",
"config": {"model": "text-embedding-3-small"},
}
# Step 3: Create Knowledge and ingest
knowledge = Knowledge(
collection_name="engineering_docs",
sources=sources,
embedder=embedder,
)
knowledge.add_sources() # Parses, chunks, embeds, and stores all sources
# The collection "engineering_docs" is now queryable
results = knowledge.query(query=["how to deploy the service"])
Related Pages
- Implementation:CrewAIInc_CrewAI_Knowledge_Constructor -- Concrete Knowledge class implementation
- Principle:CrewAIInc_CrewAI_Knowledge_Source_Selection -- Previous step: source selection
- Principle:CrewAIInc_CrewAI_Embedding_Configuration -- Previous step: embedding configuration
- Principle:CrewAIInc_CrewAI_Knowledge_Attachment -- Next step: attaching knowledge to crews/agents