Principle:Microsoft Semantic kernel RAG Chat Augmentation
Overview
The RAG Chat Augmentation principle describes the final stage of the Vector Store RAG Pipeline: taking retrieved search results and using them to augment the context provided to a large language model (LLM) during chat completion. RAG — Retrieval Augmented Generation — bridges the gap between an LLM's static training data and an organization's dynamic, proprietary knowledge.
Rather than relying solely on what the LLM "knows" from training, RAG injects relevant, up-to-date context retrieved from a vector store directly into the prompt, enabling the model to generate accurate, grounded responses.
Motivation
Large language models have two fundamental limitations that RAG addresses:
- Knowledge cutoff: LLMs are trained on data up to a specific date and have no awareness of events, documents, or information created after that cutoff
- Hallucination: When asked about topics outside their training data (or when confident but wrong), LLMs may generate plausible-sounding but incorrect information
RAG mitigates both issues by:
- Retrieving relevant information from a vector store at query time
- Augmenting the LLM prompt with this retrieved context
- Generating a response that is grounded in the provided context
The result is an LLM that can answer questions about private, proprietary, or recent data while maintaining the fluency and reasoning capabilities of the base model.
Core Concepts
The RAG Pipeline
The complete RAG pipeline consists of three phases that correspond to the other principles in this workflow:
- Retrieval: The user's question is embedded and used to search the vector store for relevant records (see Vector Similarity Search)
- Augmentation: The retrieved records are formatted and injected into the LLM prompt as contextual information
- Generation: The LLM generates a response that draws on both its training data and the injected context
The augmentation phase is the bridge between retrieval and generation.
Prompt Template Integration
Semantic Kernel integrates RAG into chat completion through its prompt template system. Retrieved search results are injected into a template that instructs the LLM to use the provided context when answering. The Handlebars template syntax provides the mechanism:
- The template defines a placeholder where search results are inserted
- At runtime, the search plugin retrieves relevant records based on the user's question
- The template engine renders the results into the prompt
- The complete prompt (with context) is sent to the LLM
The Search Plugin Pattern
Rather than manually embedding queries and calling search APIs, Semantic Kernel encapsulates the entire search pipeline into a kernel plugin. This plugin:
- Accepts a text query (the user's question)
- Internally generates an embedding for the query
- Executes a vector similarity search against the configured collection
- Returns the text content of the matching records
By wrapping search as a plugin, it becomes usable within prompt templates, function calling, and other Semantic Kernel orchestration patterns.
Context Window Management
The amount of context injected into the prompt must be carefully managed:
- Too little context: The LLM lacks sufficient information to answer accurately
- Too much context: The prompt exceeds the LLM's context window, or the model becomes confused by irrelevant information
- Optimal context: A focused set of the most relevant records that directly addresses the question
The top parameter in vector search and the template design both influence how much context reaches the LLM.
Design Principles
Separation of Retrieval and Generation
The retrieval and generation steps are decoupled. The search plugin handles retrieval independently of the chat completion model. This means:
- Different embedding models and chat models can be used (as long as the embedding model matches what was used at ingestion)
- The retrieval logic can be tested independently of the LLM
- The same retrieved context can be used with different prompt templates or models
Grounded Generation
The prompt template explicitly instructs the LLM to base its response on the provided context. This grounding instruction is critical for:
- Reducing hallucination by constraining the model to provided facts
- Enabling citation of sources when records include source metadata
- Allowing the model to say "I don't have enough information" when the context is insufficient
Plugin Composability
By exposing RAG search as a kernel plugin, it becomes composable with other Semantic Kernel capabilities:
- Planners can decide when to invoke the search plugin based on the user's intent
- Function calling allows the LLM itself to decide when to retrieve context
- Multiple search plugins can be registered for different knowledge domains
The Complete RAG Flow
The end-to-end flow in a chat application:
- User sends a question
- The application creates a prompt from a template that includes a search plugin invocation
- The search plugin embeds the question and searches the vector store
- Retrieved records are injected into the prompt as context
- The augmented prompt is sent to the chat completion model
- The LLM generates a response grounded in the retrieved context
- The response is returned to the user
Relationship to Other Principles
RAG Chat Augmentation is the culmination of the entire Vector Store RAG Pipeline:
- Vector Store Data Model defines the records that are retrieved and used as context
- Vector Store Collection Setup establishes the collection being searched
- Embedding Generation converts the user's question into a search vector
- Data Ingestion populates the knowledge base that RAG draws from
- Vector Similarity Search retrieves the relevant records
- Metadata Filtering optionally narrows the search to specific domains
Implementation:Microsoft_Semantic_kernel_VectorStoreTextSearch_RAG