Implementation:Microsoft Semantic kernel Collection UpsertAsync

Overview

This page documents the EnsureCollectionExistsAsync and UpsertAsync methods on VectorStoreCollection<TKey, TRecord>, which together form the data ingestion API for persisting embedded records into a vector store collection.

Source Reference

File: dotnet/samples/GettingStartedWithVectorStores/Step1_Ingest_Data.cs (lines 38-57)
Type: API Doc

API Reference

`EnsureCollectionExistsAsync`

await collection.EnsureCollectionExistsAsync();

Aspect	Detail
Returns	`Task`
Parameters	`CancellationToken` (optional)
Behavior	Creates the collection and indexes in the backend if they do not exist; no-op if already present
Idempotent	Yes

This method must be called before any UpsertAsync calls. It reads the TRecord type's attributes to determine:

What fields to create (from [VectorStoreData] properties)
Which fields to index (from IsIndexed = true parameters)
The vector field configuration (from [VectorStoreVector(Dimensions)])

`UpsertAsync` (Single Record)

await collection.UpsertAsync(record);

Aspect	Detail
Returns	`Task<TKey>` — the key of the upserted record
Parameters	`record` (`TRecord`) — the fully populated record to upsert
Behavior	Inserts the record if the key does not exist; updates if it does

`UpsertAsync` (Batch)

await collection.UpsertAsync(records);

Aspect	Detail
Returns	`Task<IReadOnlyList<TKey>>` — the keys of all upserted records
Parameters	`records` (`IEnumerable<TRecord>`) — collection of records to upsert
Behavior	Batch upsert with the same insert-or-update semantics per record

Complete Ingestion Example

The following example demonstrates the full ingestion flow from the Semantic Kernel getting started samples:

// Prerequisites: vectorStore and embeddingGenerator are already configured

// Step 1: Get a typed collection
var collection = vectorStore.GetCollection<string, Glossary>("skglossary");

// Step 2: Ensure the collection exists
await collection.EnsureCollectionExistsAsync();

// Step 3: Prepare records with embeddings
var glossaryEntries = new List<Glossary>
{
    new Glossary
    {
        Key = "sk",
        Category = "AI",
        Term = "Semantic Kernel",
        Definition = "Semantic Kernel is a lightweight SDK for integrating AI services."
    },
    new Glossary
    {
        Key = "rag",
        Category = "AI",
        Term = "RAG",
        Definition = "Retrieval Augmented Generation combines search with LLM generation."
    }
};

// Step 4: Generate embeddings for each entry
foreach (var entry in glossaryEntries)
{
    entry.DefinitionEmbedding = (await embeddingGenerator.GenerateAsync(entry.Definition)).Vector;
}

// Step 5: Upsert all records
await collection.UpsertAsync(glossaryEntries);

Minimal Ingestion Pattern

The simplest possible ingestion (from the sample code) condenses to:

var vectorStore = new InMemoryVectorStore();
var collection = vectorStore.GetCollection<string, Glossary>("skglossary");
await collection.EnsureCollectionExistsAsync();

// For each entry: generate embedding, then upsert
entry.DefinitionEmbedding = (await embeddingGenerator.GenerateAsync(entry.Definition)).Vector;
await collection.UpsertAsync(glossaryEntries);

Upsert Behavior Details

Key-Based Identity

The upsert operation uses the [VectorStoreKey] property to determine record identity:

// First upsert: creates the record
var record1 = new Glossary { Key = "sk", Term = "Semantic Kernel", ... };
await collection.UpsertAsync(record1);

// Second upsert with same key: updates the record
var record2 = new Glossary { Key = "sk", Term = "Semantic Kernel (Updated)", ... };
await collection.UpsertAsync(record2);

// Only one record exists with Key = "sk", containing the updated data

All Fields Are Replaced

When updating an existing record, all fields are replaced with the new values — not just the changed fields. This is a full replacement, not a partial update. Ensure that all properties are populated before calling UpsertAsync.

Error Scenarios

Scenario	Behavior
Collection does not exist (missing `EnsureCollectionExistsAsync`)	Backend-specific error (typically "collection not found")
Record missing key value	`ArgumentException` or backend error
Vector property is null or empty	Backend-specific error during indexing
Vector dimensions mismatch	Backend-specific error (dimension mismatch)
Duplicate keys in batch	Last record with that key wins (upsert semantics)

Performance Considerations

Batch vs Single Upsert

Approach	Network Calls	Use When
`UpsertAsync(singleRecord)`	One per record	Ingesting records one at a time (e.g., streaming)
`UpsertAsync(recordCollection)`	One for all records	Ingesting a batch of pre-prepared records

The batch overload is significantly more efficient for bulk ingestion because it reduces network overhead.

Embedding Generation Bottleneck

In most ingestion pipelines, embedding generation (the API call to the AI service) is the performance bottleneck, not the upsert itself. Consider:

Batching embedding requests to reduce API calls
Using parallel processing for embedding generation
Caching embeddings for data that does not change frequently

Relationship to Principle

This implementation page corresponds to the Data Ingestion principle, which explains the motivation for upsert semantics and the two-step ingestion pattern.

Principle:Microsoft_Semantic_kernel_Data_Ingestion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment