Principle:Microsoft Semantic kernel Data Ingestion

Overview

The Data Ingestion principle describes the process of persisting embedded records into a vector store collection. Ingestion is the step that transforms prepared, in-memory data model instances — complete with their text content and computed vector embeddings — into durable, searchable records in the vector store backend.

Semantic Kernel's ingestion API follows upsert semantics (update-or-insert): if a record with the given key already exists, it is updated; if not, it is created. This makes the ingestion process idempotent and safe for repeated execution.

Motivation

A vector store is only useful if it contains data. The ingestion step bridges the gap between data preparation (embedding generation, content formatting) and data utilization (search, retrieval). Several design challenges arise:

Idempotency: Re-running an ingestion pipeline should not create duplicate records or corrupt existing data
Collection lifecycle: The target collection must exist before records can be inserted
Atomicity: Individual record operations should succeed or fail independently
Efficiency: Bulk ingestion of many records should minimize round-trips to the backend

The Data Ingestion principle addresses these challenges through a simple two-step pattern: ensure the collection exists, then upsert the records.

Core Concepts

Upsert Semantics

The term upsert combines "update" and "insert":

If a record with the specified key does not exist in the collection, a new record is inserted
If a record with the specified key already exists, the existing record is updated with the new values

This behavior is consistent across all vector store backends and eliminates the need for separate "check if exists, then insert or update" logic.

Two-Step Ingestion Pattern

Every ingestion workflow follows the same two steps:

EnsureCollectionExistsAsync() — Idempotently creates the collection and its indexes if they do not already exist
UpsertAsync(records) — Persists one or more records into the collection using upsert semantics

This pattern is deliberately simple. More complex pipelines (chunking, incremental updates, deduplication) are built on top of these two primitives.

Pre-Ingestion Requirements

Before calling UpsertAsync, each record must be fully prepared:

The key property must have a unique value
All data properties should be populated with their content
The vector property must contain a valid embedding of the correct dimensionality

Attempting to upsert a record with a null or empty vector field will result in a runtime error on most backends.

Design Principles

Simplicity Over Flexibility

The ingestion API intentionally provides a single UpsertAsync method rather than separate insert and update methods. This reduces cognitive load and eliminates an entire category of "does it exist?" conditional logic.

Collection-Level Operations

All ingestion operations happen at the collection level, not the store level. This means:

You must first obtain a typed collection reference via GetCollection
The collection enforces type safety — only records of the correct type can be upserted
Different collections can be ingested independently and concurrently

Async-First Design

All ingestion methods are asynchronous, reflecting the reality that vector store backends are typically remote services. The await-based API ensures efficient use of I/O threads and natural integration with the .NET async ecosystem.

Ingestion Pipeline Overview

A typical ingestion pipeline consists of these stages:

Data acquisition: Load raw data from files, databases, APIs, or other sources
Data transformation: Clean, chunk, and format the data into record instances
Embedding generation: Generate vector embeddings for text fields using IEmbeddingGenerator
Collection setup: Ensure the target collection exists
Record persistence: Upsert the prepared records into the collection

Steps 1-3 are application-specific. Steps 4-5 use the Semantic Kernel vector store API documented on the corresponding implementation page.

Bulk Ingestion Considerations

When ingesting large datasets:

Batch embedding generation: Generate embeddings for multiple records in a single API call to reduce latency
Batch upsert: The overload of UpsertAsync that accepts a collection of records is more efficient than upserting records one at a time
Error handling: Consider wrapping upsert calls in try-catch blocks to handle individual record failures without aborting the entire ingestion

Relationship to Other Principles

Vector Store Data Model defines the record type that is ingested
Vector Store Collection Setup creates the collection that receives the data
Embedding Generation produces the vectors that are stored with each record
Vector Similarity Search queries the ingested data
RAG Chat Augmentation consumes the ingested data for LLM-augmented responses

Implementation:Microsoft_Semantic_kernel_Collection_UpsertAsync

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment