Principle:Neuml Txtai RAG Query Execution
| Knowledge Sources | |
|---|---|
| Domains | NLP, Information_Retrieval, RAG |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
RAG query execution is the end-to-end process of taking a user question, retrieving relevant context from the knowledge base, assembling a prompt, generating an answer via a language model, and formatting the output.
Description
Once a RAG pipeline has been configured with a retrieval backend and a generative model, the execution phase handles the actual question-answering flow. This involves a multi-step orchestration: the question is used as a search query against the embeddings index, the top-scoring passages are assembled into a context string, the context and question are merged into a prompt using the configured template, the prompt is sent to the language model, and the raw model output is formatted into the desired output structure.
The execution phase supports several input formats. A single string question is the simplest case. A list of questions enables batch processing. Structured inputs (tuples or dictionaries with name, query, question, and snippet fields) allow advanced use cases where the search query differs from the displayed question, or where a specific identifier needs to be propagated through to the output.
Output formatting is equally flexible. The default format returns (name, answer) tuples. A flattened format returns plain answer strings. A reference format returns (name, answer, reference) tuples where the reference identifies which context passage most directly supports the answer. This flexibility allows RAG execution to serve different application needs -- from simple chatbots that need only answer text, to audit-oriented systems that require full provenance tracking.
Usage
Execute RAG queries when you need to:
- Answer one or more questions using knowledge from an embeddings index.
- Perform batch question-answering over a list of queries.
- Retrieve answers with provenance references back to source documents.
- Override retrieved context with explicit text passages for controlled generation.
Theoretical Basis
RAG query execution can be decomposed into four sequential stages:
FUNCTION execute_rag(question, index, model, template, top_k, min_score):
# Stage 1: Retrieve
candidates = index.search(question, top_k)
context_passages = [c.text FOR c IN candidates IF c.score >= min_score]
# Stage 2: Assemble context
context = separator.join(context_passages)
# Stage 3: Generate
prompt = template.format(question=question, context=context)
raw_answer = model.generate(prompt)
# Stage 4: Format output
RETURN format_output(raw_answer, output_mode)
Stage 1 -- Retrieval: The question (or a derived query) is used to search the embeddings index. The search returns a ranked list of (id, text, score) triples. A minimum score threshold filters out low-confidence matches, and a minimum token count filter removes trivially short passages.
Stage 2 -- Context Assembly: The top-k passages are concatenated into a single context string using a configurable separator. When explicit texts are provided (bypassing the index), they are scored for relevance and ordered accordingly. The ordering preserves the original document sequence when explicit texts are given, or ranks by score when using index search results.
Stage 3 -- Generation: The template merges the question and context into a prompt. If a system prompt is configured, the input is formatted as a multi-turn message structure with system and user roles. The prompt is then passed to the language model for generation.
Stage 4 -- Output Formatting: The raw model output is wrapped into the configured format. The three supported modes each serve different needs:
- Default: returns (name, answer) -- suitable for most applications.
- Flatten: returns answer strings only -- convenient for programmatic consumption.
- Reference: returns (name, answer, reference_id) -- enables source attribution by identifying which context passage best matches the generated answer.
For batch execution, all four stages operate over lists, enabling efficient batched search and batched generation across multiple questions simultaneously.