Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft Semantic kernel Data Ingestion

From Leeroopedia
Revision as of 17:18, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Microsoft_Semantic_kernel_Data_Ingestion.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

The Data Ingestion principle describes the process of persisting embedded records into a vector store collection. Ingestion is the step that transforms prepared, in-memory data model instances — complete with their text content and computed vector embeddings — into durable, searchable records in the vector store backend.

Semantic Kernel's ingestion API follows upsert semantics (update-or-insert): if a record with the given key already exists, it is updated; if not, it is created. This makes the ingestion process idempotent and safe for repeated execution.

Motivation

A vector store is only useful if it contains data. The ingestion step bridges the gap between data preparation (embedding generation, content formatting) and data utilization (search, retrieval). Several design challenges arise:

  • Idempotency: Re-running an ingestion pipeline should not create duplicate records or corrupt existing data
  • Collection lifecycle: The target collection must exist before records can be inserted
  • Atomicity: Individual record operations should succeed or fail independently
  • Efficiency: Bulk ingestion of many records should minimize round-trips to the backend

The Data Ingestion principle addresses these challenges through a simple two-step pattern: ensure the collection exists, then upsert the records.

Core Concepts

Upsert Semantics

The term upsert combines "update" and "insert":

  • If a record with the specified key does not exist in the collection, a new record is inserted
  • If a record with the specified key already exists, the existing record is updated with the new values

This behavior is consistent across all vector store backends and eliminates the need for separate "check if exists, then insert or update" logic.

Two-Step Ingestion Pattern

Every ingestion workflow follows the same two steps:

  1. EnsureCollectionExistsAsync() — Idempotently creates the collection and its indexes if they do not already exist
  2. UpsertAsync(records) — Persists one or more records into the collection using upsert semantics

This pattern is deliberately simple. More complex pipelines (chunking, incremental updates, deduplication) are built on top of these two primitives.

Pre-Ingestion Requirements

Before calling UpsertAsync, each record must be fully prepared:

  • The key property must have a unique value
  • All data properties should be populated with their content
  • The vector property must contain a valid embedding of the correct dimensionality

Attempting to upsert a record with a null or empty vector field will result in a runtime error on most backends.

Design Principles

Simplicity Over Flexibility

The ingestion API intentionally provides a single UpsertAsync method rather than separate insert and update methods. This reduces cognitive load and eliminates an entire category of "does it exist?" conditional logic.

Collection-Level Operations

All ingestion operations happen at the collection level, not the store level. This means:

  • You must first obtain a typed collection reference via GetCollection
  • The collection enforces type safety — only records of the correct type can be upserted
  • Different collections can be ingested independently and concurrently

Async-First Design

All ingestion methods are asynchronous, reflecting the reality that vector store backends are typically remote services. The await-based API ensures efficient use of I/O threads and natural integration with the .NET async ecosystem.

Ingestion Pipeline Overview

A typical ingestion pipeline consists of these stages:

  1. Data acquisition: Load raw data from files, databases, APIs, or other sources
  2. Data transformation: Clean, chunk, and format the data into record instances
  3. Embedding generation: Generate vector embeddings for text fields using IEmbeddingGenerator
  4. Collection setup: Ensure the target collection exists
  5. Record persistence: Upsert the prepared records into the collection

Steps 1-3 are application-specific. Steps 4-5 use the Semantic Kernel vector store API documented on the corresponding implementation page.

Bulk Ingestion Considerations

When ingesting large datasets:

  • Batch embedding generation: Generate embeddings for multiple records in a single API call to reduce latency
  • Batch upsert: The overload of UpsertAsync that accepts a collection of records is more efficient than upserting records one at a time
  • Error handling: Consider wrapping upsert calls in try-catch blocks to handle individual record failures without aborting the entire ingestion

Relationship to Other Principles

Implementation:Microsoft_Semantic_kernel_Collection_UpsertAsync

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment