Principle:Microsoft Semantic kernel Vector Similarity Search
Overview
The Vector Similarity Search principle describes how semantically similar records are found by comparing a query embedding against stored vector embeddings in a collection. This is the core retrieval mechanism in the Vector Store RAG Pipeline — the step that transforms a user's natural language question into a ranked list of relevant records.
Unlike traditional keyword search, vector similarity search operates in a continuous geometric space where meaning is encoded as position. Two pieces of text with similar meaning produce vectors that are geometrically close, enabling retrieval based on semantic relevance rather than lexical overlap.
Motivation
The fundamental challenge of information retrieval is matching a user's intent with the most relevant content. Keyword-based approaches fail when:
- The user phrases the question differently than the stored content
- Relevant content uses synonyms, abbreviations, or domain-specific terminology
- The relationship between query and content is conceptual rather than lexical
Vector similarity search overcomes these limitations by operating in embedding space, where semantic similarity maps to geometric proximity. A search for "how to deploy a machine learning model" will find content about "ML model deployment" and "putting AI systems into production" because these phrases occupy nearby regions in the vector space.
Core Concepts
Distance Metrics
Vector similarity is measured using mathematical distance functions. The most common metrics are:
- Cosine similarity: Measures the angle between two vectors, ignoring magnitude. Values range from -1 (opposite) to 1 (identical direction). This is the most widely used metric for text embeddings.
- Euclidean distance: Measures the straight-line distance between two vector endpoints. Smaller values indicate greater similarity.
- Dot product: Measures the product of magnitudes and the cosine of the angle. Useful when vector magnitude carries meaning.
The choice of distance metric is typically configured at the collection level and must be consistent between index creation and search time.
Top-K Retrieval
Vector search returns the top K most similar records, where K is specified by the caller. The search process:
- Computes the distance between the query vector and every stored vector (or uses an approximate nearest neighbor index for efficiency)
- Ranks results by similarity score
- Returns the top K results
The top parameter controls the tradeoff between recall (finding all relevant records) and precision (returning only highly relevant records).
Similarity Score
Each search result includes a similarity score that quantifies how close the result's vector is to the query vector. The interpretation of the score depends on the distance metric:
- For cosine similarity: Higher scores indicate greater similarity (closer to 1.0)
- For Euclidean distance: Lower scores indicate greater similarity (closer to 0.0)
The score can be used for:
- Thresholding: Discarding results below a minimum relevance threshold
- Ranking: Ordering results by relevance in a UI
- Weighting: Giving more weight to highly similar results in downstream processing
Streaming Results
Semantic Kernel's search API returns results as an IAsyncEnumerable, which enables streaming consumption. Results are yielded one at a time as they become available, rather than buffering the entire result set in memory. This is efficient for:
- Large result sets where only the first few results may be needed
- Pipelines that process results incrementally
- Memory-constrained environments
Search Flow
A typical vector similarity search follows these steps:
- Embed the query: Convert the user's natural language query into a vector using the same embedding model used during ingestion
- Execute the search: Pass the query vector to
SearchAsyncwith the desiredtopcount - Consume results: Iterate over the returned
IAsyncEnumerableof scored results - Use the results: Display to the user, pass to an LLM for augmented generation, or process further
Design Principles
Model Consistency
The query embedding must be generated by the same model used to create the stored embeddings. This is because different models produce different vector spaces — a vector from Model A has no meaningful spatial relationship to vectors from Model B.
Separation of Embedding and Search
The search API accepts a pre-computed vector, not raw text. This separation means:
- The caller controls when and how embeddings are generated
- The same embedding can be reused for multiple searches (e.g., against different collections)
- The search API is embedding-model-agnostic
Result Composition
Each search result contains both the full record and the similarity score. The record includes all stored data fields, making it immediately usable without a separate lookup. This design avoids the N+1 query problem common in systems where search returns only IDs.
Relationship to Other Principles
- Embedding Generation produces the query vector passed to search
- Data Ingestion populates the collection being searched
- Metadata Filtering narrows search results using data field predicates
- RAG Chat Augmentation consumes search results to augment LLM prompts
Implementation:Microsoft_Semantic_kernel_Collection_SearchAsync