Principle:CrewAIInc CrewAI Knowledge Ingestion

Metadata

Field	Value
Principle Name	Knowledge Ingestion
Workflow	Knowledge_RAG_Pipeline
Category	Vector Storage
Repository	crewAIInc/crewAI
Implemented By	Implementation:CrewAIInc_CrewAI_Knowledge_Constructor

Overview

The orchestration step that takes configured knowledge sources, generates embeddings from their chunked content, and stores the resulting vectors in a persistent collection for later retrieval. Knowledge Ingestion bridges the gap between parsed document chunks and a queryable vector store.

Description

Knowledge Ingestion is the step that populates the vector store. It takes knowledge source instances (already parsed and chunked), an embedding configuration, and a collection name. The ingestion process generates embeddings for each text chunk and stores them in the vector database (ChromaDB by default, Qdrant optionally). Once ingested, the collection is queryable via semantic search.

The ingestion process follows these steps:

A Knowledge object is created with a list of sources, an embedder configuration, and a collection name
The add_sources() method is called (or triggered automatically during initialization)
For each source, source.add() is invoked, which:
- Calls load_content() to extract text from the source format
- Calls _chunk_text() to segment text into overlapping chunks
- Passes chunks to the storage backend for embedding and persistence
The storage backend embeds each chunk using the configured embedding model
Embedded vectors are stored in a named collection in the vector database

The result is a persistent, queryable collection of document vectors that can be searched semantically.

Theoretical Basis

Knowledge Ingestion implements vector indexing for approximate nearest neighbor (ANN) search. The theoretical foundation includes:

Vector indexing -- Documents are embedded as dense vectors and stored in an index structure that supports efficient similarity queries. ChromaDB uses HNSW (Hierarchical Navigable Small World) graphs by default.
Collection-based organization -- Each Knowledge object corresponds to a named collection in the vector store, enabling isolation between different knowledge domains.
Idempotent ingestion -- The reset() method allows clearing and re-ingesting a collection, supporting iterative development and knowledge updates.

The vector database serves as a semantic index that maps from query vectors to the most relevant document chunks, measured by cosine similarity or Euclidean distance.

Storage Backends

Backend	Description	Default
ChromaDB	Lightweight embedded vector database	Yes (default)
Qdrant	Production-grade vector database with filtering	Optional

Usage Context

Knowledge Ingestion is the third step in the Knowledge RAG Pipeline:

Select and configure knowledge sources (see Principle:CrewAIInc_CrewAI_Knowledge_Source_Selection)
Configure the embedding provider (see Principle:CrewAIInc_CrewAI_Embedding_Configuration)
Ingest sources into vector storage (this principle)
Attach knowledge to a Crew or Agent (see Principle:CrewAIInc_CrewAI_Knowledge_Attachment)
Retrieve relevant chunks during task execution (see Principle:CrewAIInc_CrewAI_Semantic_Retrieval)

Design Decisions

Automatic ingestion on attachment -- When knowledge sources are attached to a Crew or Agent, ingestion happens automatically during initialization. Users do not need to call add_sources() explicitly in the typical workflow.
Named collections -- Each Knowledge object uses a collection name to isolate its vectors. This allows multiple knowledge domains to coexist in the same vector database.
Pluggable storage -- The storage backend is injected into the Knowledge object, allowing users to swap ChromaDB for Qdrant or other backends without changing the rest of the pipeline.
Reset capability -- The reset() method clears all vectors in a collection, enabling clean re-ingestion when source documents change.

Example Scenario

An engineering team wants to build a knowledge base from their internal documentation:

from crewai.knowledge.knowledge import Knowledge
from crewai.knowledge.source import PDFKnowledgeSource, TextFileKnowledgeSource

# Step 1: Configure sources
sources = [
    PDFKnowledgeSource(file_paths=["docs/architecture.pdf"]),
    TextFileKnowledgeSource(file_paths=["docs/runbook.txt"]),
]

# Step 2: Configure embedder
embedder = {
    "provider": "openai",
    "config": {"model": "text-embedding-3-small"},
}

# Step 3: Create Knowledge and ingest
knowledge = Knowledge(
    collection_name="engineering_docs",
    sources=sources,
    embedder=embedder,
)
knowledge.add_sources()  # Parses, chunks, embeds, and stores all sources

# The collection "engineering_docs" is now queryable
results = knowledge.query(query=["how to deploy the service"])

Related Pages

Implementation:CrewAIInc_CrewAI_Knowledge_Constructor -- Concrete Knowledge class implementation
Principle:CrewAIInc_CrewAI_Knowledge_Source_Selection -- Previous step: source selection
Principle:CrewAIInc_CrewAI_Embedding_Configuration -- Previous step: embedding configuration
Principle:CrewAIInc_CrewAI_Knowledge_Attachment -- Next step: attaching knowledge to crews/agents

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment