Principle:Microsoft Semantic kernel Vector Similarity Search

Overview

The Vector Similarity Search principle describes how semantically similar records are found by comparing a query embedding against stored vector embeddings in a collection. This is the core retrieval mechanism in the Vector Store RAG Pipeline — the step that transforms a user's natural language question into a ranked list of relevant records.

Unlike traditional keyword search, vector similarity search operates in a continuous geometric space where meaning is encoded as position. Two pieces of text with similar meaning produce vectors that are geometrically close, enabling retrieval based on semantic relevance rather than lexical overlap.

Motivation

The fundamental challenge of information retrieval is matching a user's intent with the most relevant content. Keyword-based approaches fail when:

The user phrases the question differently than the stored content
Relevant content uses synonyms, abbreviations, or domain-specific terminology
The relationship between query and content is conceptual rather than lexical

Vector similarity search overcomes these limitations by operating in embedding space, where semantic similarity maps to geometric proximity. A search for "how to deploy a machine learning model" will find content about "ML model deployment" and "putting AI systems into production" because these phrases occupy nearby regions in the vector space.

Core Concepts

Distance Metrics

Vector similarity is measured using mathematical distance functions. The most common metrics are:

Cosine similarity: Measures the angle between two vectors, ignoring magnitude. Values range from -1 (opposite) to 1 (identical direction). This is the most widely used metric for text embeddings.
Euclidean distance: Measures the straight-line distance between two vector endpoints. Smaller values indicate greater similarity.
Dot product: Measures the product of magnitudes and the cosine of the angle. Useful when vector magnitude carries meaning.

The choice of distance metric is typically configured at the collection level and must be consistent between index creation and search time.

Top-K Retrieval

Vector search returns the top K most similar records, where K is specified by the caller. The search process:

Computes the distance between the query vector and every stored vector (or uses an approximate nearest neighbor index for efficiency)
Ranks results by similarity score
Returns the top K results

The top parameter controls the tradeoff between recall (finding all relevant records) and precision (returning only highly relevant records).

Similarity Score

Each search result includes a similarity score that quantifies how close the result's vector is to the query vector. The interpretation of the score depends on the distance metric:

For cosine similarity: Higher scores indicate greater similarity (closer to 1.0)
For Euclidean distance: Lower scores indicate greater similarity (closer to 0.0)

The score can be used for:

Thresholding: Discarding results below a minimum relevance threshold
Ranking: Ordering results by relevance in a UI
Weighting: Giving more weight to highly similar results in downstream processing

Streaming Results

Semantic Kernel's search API returns results as an IAsyncEnumerable, which enables streaming consumption. Results are yielded one at a time as they become available, rather than buffering the entire result set in memory. This is efficient for:

Large result sets where only the first few results may be needed
Pipelines that process results incrementally
Memory-constrained environments

Search Flow

A typical vector similarity search follows these steps:

Embed the query: Convert the user's natural language query into a vector using the same embedding model used during ingestion
Execute the search: Pass the query vector to SearchAsync with the desired top count
Consume results: Iterate over the returned IAsyncEnumerable of scored results
Use the results: Display to the user, pass to an LLM for augmented generation, or process further

Design Principles

Model Consistency

The query embedding must be generated by the same model used to create the stored embeddings. This is because different models produce different vector spaces — a vector from Model A has no meaningful spatial relationship to vectors from Model B.

Separation of Embedding and Search

The search API accepts a pre-computed vector, not raw text. This separation means:

The caller controls when and how embeddings are generated
The same embedding can be reused for multiple searches (e.g., against different collections)
The search API is embedding-model-agnostic

Result Composition

Each search result contains both the full record and the similarity score. The record includes all stored data fields, making it immediately usable without a separate lookup. This design avoids the N+1 query problem common in systems where search returns only IDs.

Relationship to Other Principles

Embedding Generation produces the query vector passed to search
Data Ingestion populates the collection being searched
Metadata Filtering narrows search results using data field predicates
RAG Chat Augmentation consumes search results to augment LLM prompts

Implementation:Microsoft_Semantic_kernel_Collection_SearchAsync

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment