Principle:Microsoft Semantic kernel RAG Chat Augmentation

Overview

The RAG Chat Augmentation principle describes the final stage of the Vector Store RAG Pipeline: taking retrieved search results and using them to augment the context provided to a large language model (LLM) during chat completion. RAG — Retrieval Augmented Generation — bridges the gap between an LLM's static training data and an organization's dynamic, proprietary knowledge.

Rather than relying solely on what the LLM "knows" from training, RAG injects relevant, up-to-date context retrieved from a vector store directly into the prompt, enabling the model to generate accurate, grounded responses.

Motivation

Large language models have two fundamental limitations that RAG addresses:

Knowledge cutoff: LLMs are trained on data up to a specific date and have no awareness of events, documents, or information created after that cutoff
Hallucination: When asked about topics outside their training data (or when confident but wrong), LLMs may generate plausible-sounding but incorrect information

RAG mitigates both issues by:

Retrieving relevant information from a vector store at query time
Augmenting the LLM prompt with this retrieved context
Generating a response that is grounded in the provided context

The result is an LLM that can answer questions about private, proprietary, or recent data while maintaining the fluency and reasoning capabilities of the base model.

Core Concepts

The RAG Pipeline

The complete RAG pipeline consists of three phases that correspond to the other principles in this workflow:

Retrieval: The user's question is embedded and used to search the vector store for relevant records (see Vector Similarity Search)
Augmentation: The retrieved records are formatted and injected into the LLM prompt as contextual information
Generation: The LLM generates a response that draws on both its training data and the injected context

The augmentation phase is the bridge between retrieval and generation.

Prompt Template Integration

Semantic Kernel integrates RAG into chat completion through its prompt template system. Retrieved search results are injected into a template that instructs the LLM to use the provided context when answering. The Handlebars template syntax provides the mechanism:

The template defines a placeholder where search results are inserted
At runtime, the search plugin retrieves relevant records based on the user's question
The template engine renders the results into the prompt
The complete prompt (with context) is sent to the LLM

The Search Plugin Pattern

Rather than manually embedding queries and calling search APIs, Semantic Kernel encapsulates the entire search pipeline into a kernel plugin. This plugin:

Accepts a text query (the user's question)
Internally generates an embedding for the query
Executes a vector similarity search against the configured collection
Returns the text content of the matching records

By wrapping search as a plugin, it becomes usable within prompt templates, function calling, and other Semantic Kernel orchestration patterns.

Context Window Management

The amount of context injected into the prompt must be carefully managed:

Too little context: The LLM lacks sufficient information to answer accurately
Too much context: The prompt exceeds the LLM's context window, or the model becomes confused by irrelevant information
Optimal context: A focused set of the most relevant records that directly addresses the question

The top parameter in vector search and the template design both influence how much context reaches the LLM.

Design Principles

Separation of Retrieval and Generation

The retrieval and generation steps are decoupled. The search plugin handles retrieval independently of the chat completion model. This means:

Different embedding models and chat models can be used (as long as the embedding model matches what was used at ingestion)
The retrieval logic can be tested independently of the LLM
The same retrieved context can be used with different prompt templates or models

Grounded Generation

The prompt template explicitly instructs the LLM to base its response on the provided context. This grounding instruction is critical for:

Reducing hallucination by constraining the model to provided facts
Enabling citation of sources when records include source metadata
Allowing the model to say "I don't have enough information" when the context is insufficient

Plugin Composability

By exposing RAG search as a kernel plugin, it becomes composable with other Semantic Kernel capabilities:

Planners can decide when to invoke the search plugin based on the user's intent
Function calling allows the LLM itself to decide when to retrieve context
Multiple search plugins can be registered for different knowledge domains

The Complete RAG Flow

The end-to-end flow in a chat application:

User sends a question
The application creates a prompt from a template that includes a search plugin invocation
The search plugin embeds the question and searches the vector store
Retrieved records are injected into the prompt as context
The augmented prompt is sent to the chat completion model
The LLM generates a response grounded in the retrieved context
The response is returned to the user

Relationship to Other Principles

RAG Chat Augmentation is the culmination of the entire Vector Store RAG Pipeline:

Vector Store Data Model defines the records that are retrieved and used as context
Vector Store Collection Setup establishes the collection being searched
Embedding Generation converts the user's question into a search vector
Data Ingestion populates the knowledge base that RAG draws from
Vector Similarity Search retrieves the relevant records
Metadata Filtering optionally narrows the search to specific domains

Implementation:Microsoft_Semantic_kernel_VectorStoreTextSearch_RAG

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment