Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft Semantic kernel Collection UpsertAsync

From Leeroopedia

Overview

This page documents the EnsureCollectionExistsAsync and UpsertAsync methods on VectorStoreCollection<TKey, TRecord>, which together form the data ingestion API for persisting embedded records into a vector store collection.

Source Reference

  • File: dotnet/samples/GettingStartedWithVectorStores/Step1_Ingest_Data.cs (lines 38-57)
  • Type: API Doc

API Reference

EnsureCollectionExistsAsync

await collection.EnsureCollectionExistsAsync();
Aspect Detail
Returns Task
Parameters CancellationToken (optional)
Behavior Creates the collection and indexes in the backend if they do not exist; no-op if already present
Idempotent Yes

This method must be called before any UpsertAsync calls. It reads the TRecord type's attributes to determine:

  • What fields to create (from [VectorStoreData] properties)
  • Which fields to index (from IsIndexed = true parameters)
  • The vector field configuration (from [VectorStoreVector(Dimensions)])

UpsertAsync (Single Record)

await collection.UpsertAsync(record);
Aspect Detail
Returns Task<TKey> — the key of the upserted record
Parameters record (TRecord) — the fully populated record to upsert
Behavior Inserts the record if the key does not exist; updates if it does

UpsertAsync (Batch)

await collection.UpsertAsync(records);
Aspect Detail
Returns Task<IReadOnlyList<TKey>> — the keys of all upserted records
Parameters records (IEnumerable<TRecord>) — collection of records to upsert
Behavior Batch upsert with the same insert-or-update semantics per record

Complete Ingestion Example

The following example demonstrates the full ingestion flow from the Semantic Kernel getting started samples:

// Prerequisites: vectorStore and embeddingGenerator are already configured

// Step 1: Get a typed collection
var collection = vectorStore.GetCollection<string, Glossary>("skglossary");

// Step 2: Ensure the collection exists
await collection.EnsureCollectionExistsAsync();

// Step 3: Prepare records with embeddings
var glossaryEntries = new List<Glossary>
{
    new Glossary
    {
        Key = "sk",
        Category = "AI",
        Term = "Semantic Kernel",
        Definition = "Semantic Kernel is a lightweight SDK for integrating AI services."
    },
    new Glossary
    {
        Key = "rag",
        Category = "AI",
        Term = "RAG",
        Definition = "Retrieval Augmented Generation combines search with LLM generation."
    }
};

// Step 4: Generate embeddings for each entry
foreach (var entry in glossaryEntries)
{
    entry.DefinitionEmbedding = (await embeddingGenerator.GenerateAsync(entry.Definition)).Vector;
}

// Step 5: Upsert all records
await collection.UpsertAsync(glossaryEntries);

Minimal Ingestion Pattern

The simplest possible ingestion (from the sample code) condenses to:

var vectorStore = new InMemoryVectorStore();
var collection = vectorStore.GetCollection<string, Glossary>("skglossary");
await collection.EnsureCollectionExistsAsync();

// For each entry: generate embedding, then upsert
entry.DefinitionEmbedding = (await embeddingGenerator.GenerateAsync(entry.Definition)).Vector;
await collection.UpsertAsync(glossaryEntries);

Upsert Behavior Details

Key-Based Identity

The upsert operation uses the [VectorStoreKey] property to determine record identity:

// First upsert: creates the record
var record1 = new Glossary { Key = "sk", Term = "Semantic Kernel", ... };
await collection.UpsertAsync(record1);

// Second upsert with same key: updates the record
var record2 = new Glossary { Key = "sk", Term = "Semantic Kernel (Updated)", ... };
await collection.UpsertAsync(record2);

// Only one record exists with Key = "sk", containing the updated data

All Fields Are Replaced

When updating an existing record, all fields are replaced with the new values — not just the changed fields. This is a full replacement, not a partial update. Ensure that all properties are populated before calling UpsertAsync.

Error Scenarios

Scenario Behavior
Collection does not exist (missing EnsureCollectionExistsAsync) Backend-specific error (typically "collection not found")
Record missing key value ArgumentException or backend error
Vector property is null or empty Backend-specific error during indexing
Vector dimensions mismatch Backend-specific error (dimension mismatch)
Duplicate keys in batch Last record with that key wins (upsert semantics)

Performance Considerations

Batch vs Single Upsert

Approach Network Calls Use When
UpsertAsync(singleRecord) One per record Ingesting records one at a time (e.g., streaming)
UpsertAsync(recordCollection) One for all records Ingesting a batch of pre-prepared records

The batch overload is significantly more efficient for bulk ingestion because it reduces network overhead.

Embedding Generation Bottleneck

In most ingestion pipelines, embedding generation (the API call to the AI service) is the performance bottleneck, not the upsert itself. Consider:

  • Batching embedding requests to reduce API calls
  • Using parallel processing for embedding generation
  • Caching embeddings for data that does not change frequently

Relationship to Principle

This implementation page corresponds to the Data Ingestion principle, which explains the motivation for upsert semantics and the two-step ingestion pattern.

Principle:Microsoft_Semantic_kernel_Data_Ingestion

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment