Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:CrewAIInc CrewAI Knowledge Ingestion

From Leeroopedia

Metadata

Field Value
Principle Name Knowledge Ingestion
Workflow Knowledge_RAG_Pipeline
Category Vector Storage
Repository crewAIInc/crewAI
Implemented By Implementation:CrewAIInc_CrewAI_Knowledge_Constructor

Overview

The orchestration step that takes configured knowledge sources, generates embeddings from their chunked content, and stores the resulting vectors in a persistent collection for later retrieval. Knowledge Ingestion bridges the gap between parsed document chunks and a queryable vector store.

Description

Knowledge Ingestion is the step that populates the vector store. It takes knowledge source instances (already parsed and chunked), an embedding configuration, and a collection name. The ingestion process generates embeddings for each text chunk and stores them in the vector database (ChromaDB by default, Qdrant optionally). Once ingested, the collection is queryable via semantic search.

The ingestion process follows these steps:

  1. A Knowledge object is created with a list of sources, an embedder configuration, and a collection name
  2. The add_sources() method is called (or triggered automatically during initialization)
  3. For each source, source.add() is invoked, which:
    • Calls load_content() to extract text from the source format
    • Calls _chunk_text() to segment text into overlapping chunks
    • Passes chunks to the storage backend for embedding and persistence
  4. The storage backend embeds each chunk using the configured embedding model
  5. Embedded vectors are stored in a named collection in the vector database

The result is a persistent, queryable collection of document vectors that can be searched semantically.

Theoretical Basis

Knowledge Ingestion implements vector indexing for approximate nearest neighbor (ANN) search. The theoretical foundation includes:

  • Vector indexing -- Documents are embedded as dense vectors and stored in an index structure that supports efficient similarity queries. ChromaDB uses HNSW (Hierarchical Navigable Small World) graphs by default.
  • Collection-based organization -- Each Knowledge object corresponds to a named collection in the vector store, enabling isolation between different knowledge domains.
  • Idempotent ingestion -- The reset() method allows clearing and re-ingesting a collection, supporting iterative development and knowledge updates.

The vector database serves as a semantic index that maps from query vectors to the most relevant document chunks, measured by cosine similarity or Euclidean distance.

Storage Backends

Backend Description Default
ChromaDB Lightweight embedded vector database Yes (default)
Qdrant Production-grade vector database with filtering Optional

Usage Context

Knowledge Ingestion is the third step in the Knowledge RAG Pipeline:

  1. Select and configure knowledge sources (see Principle:CrewAIInc_CrewAI_Knowledge_Source_Selection)
  2. Configure the embedding provider (see Principle:CrewAIInc_CrewAI_Embedding_Configuration)
  3. Ingest sources into vector storage (this principle)
  4. Attach knowledge to a Crew or Agent (see Principle:CrewAIInc_CrewAI_Knowledge_Attachment)
  5. Retrieve relevant chunks during task execution (see Principle:CrewAIInc_CrewAI_Semantic_Retrieval)

Design Decisions

  • Automatic ingestion on attachment -- When knowledge sources are attached to a Crew or Agent, ingestion happens automatically during initialization. Users do not need to call add_sources() explicitly in the typical workflow.
  • Named collections -- Each Knowledge object uses a collection name to isolate its vectors. This allows multiple knowledge domains to coexist in the same vector database.
  • Pluggable storage -- The storage backend is injected into the Knowledge object, allowing users to swap ChromaDB for Qdrant or other backends without changing the rest of the pipeline.
  • Reset capability -- The reset() method clears all vectors in a collection, enabling clean re-ingestion when source documents change.

Example Scenario

An engineering team wants to build a knowledge base from their internal documentation:

from crewai.knowledge.knowledge import Knowledge
from crewai.knowledge.source import PDFKnowledgeSource, TextFileKnowledgeSource

# Step 1: Configure sources
sources = [
    PDFKnowledgeSource(file_paths=["docs/architecture.pdf"]),
    TextFileKnowledgeSource(file_paths=["docs/runbook.txt"]),
]

# Step 2: Configure embedder
embedder = {
    "provider": "openai",
    "config": {"model": "text-embedding-3-small"},
}

# Step 3: Create Knowledge and ingest
knowledge = Knowledge(
    collection_name="engineering_docs",
    sources=sources,
    embedder=embedder,
)
knowledge.add_sources()  # Parses, chunks, embeds, and stores all sources

# The collection "engineering_docs" is now queryable
results = knowledge.query(query=["how to deploy the service"])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment