Principle:Microsoft Semantic kernel Data Ingestion
Overview
The Data Ingestion principle describes the process of persisting embedded records into a vector store collection. Ingestion is the step that transforms prepared, in-memory data model instances — complete with their text content and computed vector embeddings — into durable, searchable records in the vector store backend.
Semantic Kernel's ingestion API follows upsert semantics (update-or-insert): if a record with the given key already exists, it is updated; if not, it is created. This makes the ingestion process idempotent and safe for repeated execution.
Motivation
A vector store is only useful if it contains data. The ingestion step bridges the gap between data preparation (embedding generation, content formatting) and data utilization (search, retrieval). Several design challenges arise:
- Idempotency: Re-running an ingestion pipeline should not create duplicate records or corrupt existing data
- Collection lifecycle: The target collection must exist before records can be inserted
- Atomicity: Individual record operations should succeed or fail independently
- Efficiency: Bulk ingestion of many records should minimize round-trips to the backend
The Data Ingestion principle addresses these challenges through a simple two-step pattern: ensure the collection exists, then upsert the records.
Core Concepts
Upsert Semantics
The term upsert combines "update" and "insert":
- If a record with the specified key does not exist in the collection, a new record is inserted
- If a record with the specified key already exists, the existing record is updated with the new values
This behavior is consistent across all vector store backends and eliminates the need for separate "check if exists, then insert or update" logic.
Two-Step Ingestion Pattern
Every ingestion workflow follows the same two steps:
EnsureCollectionExistsAsync()— Idempotently creates the collection and its indexes if they do not already existUpsertAsync(records)— Persists one or more records into the collection using upsert semantics
This pattern is deliberately simple. More complex pipelines (chunking, incremental updates, deduplication) are built on top of these two primitives.
Pre-Ingestion Requirements
Before calling UpsertAsync, each record must be fully prepared:
- The key property must have a unique value
- All data properties should be populated with their content
- The vector property must contain a valid embedding of the correct dimensionality
Attempting to upsert a record with a null or empty vector field will result in a runtime error on most backends.
Design Principles
Simplicity Over Flexibility
The ingestion API intentionally provides a single UpsertAsync method rather than separate insert and update methods. This reduces cognitive load and eliminates an entire category of "does it exist?" conditional logic.
Collection-Level Operations
All ingestion operations happen at the collection level, not the store level. This means:
- You must first obtain a typed collection reference via
GetCollection - The collection enforces type safety — only records of the correct type can be upserted
- Different collections can be ingested independently and concurrently
Async-First Design
All ingestion methods are asynchronous, reflecting the reality that vector store backends are typically remote services. The await-based API ensures efficient use of I/O threads and natural integration with the .NET async ecosystem.
Ingestion Pipeline Overview
A typical ingestion pipeline consists of these stages:
- Data acquisition: Load raw data from files, databases, APIs, or other sources
- Data transformation: Clean, chunk, and format the data into record instances
- Embedding generation: Generate vector embeddings for text fields using
IEmbeddingGenerator - Collection setup: Ensure the target collection exists
- Record persistence: Upsert the prepared records into the collection
Steps 1-3 are application-specific. Steps 4-5 use the Semantic Kernel vector store API documented on the corresponding implementation page.
Bulk Ingestion Considerations
When ingesting large datasets:
- Batch embedding generation: Generate embeddings for multiple records in a single API call to reduce latency
- Batch upsert: The overload of
UpsertAsyncthat accepts a collection of records is more efficient than upserting records one at a time - Error handling: Consider wrapping upsert calls in try-catch blocks to handle individual record failures without aborting the entire ingestion
Relationship to Other Principles
- Vector Store Data Model defines the record type that is ingested
- Vector Store Collection Setup creates the collection that receives the data
- Embedding Generation produces the vectors that are stored with each record
- Vector Similarity Search queries the ingested data
- RAG Chat Augmentation consumes the ingested data for LLM-augmented responses
Implementation:Microsoft_Semantic_kernel_Collection_UpsertAsync