Implementation:Microsoft Semantic kernel Collection UpsertAsync
Overview
This page documents the EnsureCollectionExistsAsync and UpsertAsync methods on VectorStoreCollection<TKey, TRecord>, which together form the data ingestion API for persisting embedded records into a vector store collection.
Source Reference
- File:
dotnet/samples/GettingStartedWithVectorStores/Step1_Ingest_Data.cs(lines 38-57) - Type: API Doc
API Reference
EnsureCollectionExistsAsync
await collection.EnsureCollectionExistsAsync();
| Aspect | Detail |
|---|---|
| Returns | Task
|
| Parameters | CancellationToken (optional)
|
| Behavior | Creates the collection and indexes in the backend if they do not exist; no-op if already present |
| Idempotent | Yes |
This method must be called before any UpsertAsync calls. It reads the TRecord type's attributes to determine:
- What fields to create (from
[VectorStoreData]properties) - Which fields to index (from
IsIndexed = trueparameters) - The vector field configuration (from
[VectorStoreVector(Dimensions)])
UpsertAsync (Single Record)
await collection.UpsertAsync(record);
| Aspect | Detail |
|---|---|
| Returns | Task<TKey> — the key of the upserted record
|
| Parameters | record (TRecord) — the fully populated record to upsert
|
| Behavior | Inserts the record if the key does not exist; updates if it does |
UpsertAsync (Batch)
await collection.UpsertAsync(records);
| Aspect | Detail |
|---|---|
| Returns | Task<IReadOnlyList<TKey>> — the keys of all upserted records
|
| Parameters | records (IEnumerable<TRecord>) — collection of records to upsert
|
| Behavior | Batch upsert with the same insert-or-update semantics per record |
Complete Ingestion Example
The following example demonstrates the full ingestion flow from the Semantic Kernel getting started samples:
// Prerequisites: vectorStore and embeddingGenerator are already configured
// Step 1: Get a typed collection
var collection = vectorStore.GetCollection<string, Glossary>("skglossary");
// Step 2: Ensure the collection exists
await collection.EnsureCollectionExistsAsync();
// Step 3: Prepare records with embeddings
var glossaryEntries = new List<Glossary>
{
new Glossary
{
Key = "sk",
Category = "AI",
Term = "Semantic Kernel",
Definition = "Semantic Kernel is a lightweight SDK for integrating AI services."
},
new Glossary
{
Key = "rag",
Category = "AI",
Term = "RAG",
Definition = "Retrieval Augmented Generation combines search with LLM generation."
}
};
// Step 4: Generate embeddings for each entry
foreach (var entry in glossaryEntries)
{
entry.DefinitionEmbedding = (await embeddingGenerator.GenerateAsync(entry.Definition)).Vector;
}
// Step 5: Upsert all records
await collection.UpsertAsync(glossaryEntries);
Minimal Ingestion Pattern
The simplest possible ingestion (from the sample code) condenses to:
var vectorStore = new InMemoryVectorStore();
var collection = vectorStore.GetCollection<string, Glossary>("skglossary");
await collection.EnsureCollectionExistsAsync();
// For each entry: generate embedding, then upsert
entry.DefinitionEmbedding = (await embeddingGenerator.GenerateAsync(entry.Definition)).Vector;
await collection.UpsertAsync(glossaryEntries);
Upsert Behavior Details
Key-Based Identity
The upsert operation uses the [VectorStoreKey] property to determine record identity:
// First upsert: creates the record
var record1 = new Glossary { Key = "sk", Term = "Semantic Kernel", ... };
await collection.UpsertAsync(record1);
// Second upsert with same key: updates the record
var record2 = new Glossary { Key = "sk", Term = "Semantic Kernel (Updated)", ... };
await collection.UpsertAsync(record2);
// Only one record exists with Key = "sk", containing the updated data
All Fields Are Replaced
When updating an existing record, all fields are replaced with the new values — not just the changed fields. This is a full replacement, not a partial update. Ensure that all properties are populated before calling UpsertAsync.
Error Scenarios
| Scenario | Behavior |
|---|---|
Collection does not exist (missing EnsureCollectionExistsAsync) |
Backend-specific error (typically "collection not found") |
| Record missing key value | ArgumentException or backend error
|
| Vector property is null or empty | Backend-specific error during indexing |
| Vector dimensions mismatch | Backend-specific error (dimension mismatch) |
| Duplicate keys in batch | Last record with that key wins (upsert semantics) |
Performance Considerations
Batch vs Single Upsert
| Approach | Network Calls | Use When |
|---|---|---|
UpsertAsync(singleRecord) |
One per record | Ingesting records one at a time (e.g., streaming) |
UpsertAsync(recordCollection) |
One for all records | Ingesting a batch of pre-prepared records |
The batch overload is significantly more efficient for bulk ingestion because it reduces network overhead.
Embedding Generation Bottleneck
In most ingestion pipelines, embedding generation (the API call to the AI service) is the performance bottleneck, not the upsert itself. Consider:
- Batching embedding requests to reduce API calls
- Using parallel processing for embedding generation
- Caching embeddings for data that does not change frequently
Relationship to Principle
This implementation page corresponds to the Data Ingestion principle, which explains the motivation for upsert semantics and the two-step ingestion pattern.
Principle:Microsoft_Semantic_kernel_Data_Ingestion