Principle:Run llama Llama index Query Execution
| Knowledge Sources | |
|---|---|
| Domains | RAG, LLM_Integration |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
The execution step that runs a natural language query through the Retrieval-Augmented Generation (RAG) pipeline, retrieving relevant context from the index and synthesizing a grounded response via an LLM.
Description
Query execution is the culmination of the RAG pipeline. When a user submits a question, the query engine orchestrates a two-phase process: retrieve and synthesize. In the retrieval phase, the engine's retriever searches the index for the most relevant nodes (text chunks with metadata and embeddings). In the synthesis phase, the retrieved nodes are passed as context to an LLM along with the original query, producing a response that is grounded in the indexed knowledge.
The principle follows the template method pattern: the BaseQueryEngine defines the overall query workflow (preprocess, retrieve, postprocess, synthesize), while concrete implementations like RetrieverQueryEngine fill in each step. The result is a Response object that bundles the generated text with provenance information -- the source_nodes that contributed to the answer -- enabling transparency and auditability.
Usage
Use this principle after creating a query engine (via as_query_engine() or manual construction). Key considerations:
- Synchronous vs. asynchronous: Use query() for synchronous execution and aquery() for async contexts (e.g., web servers, concurrent batch processing)
- Granular control: Call retrieve() and synthesize() separately when you need to inspect, filter, or augment nodes between retrieval and synthesis
- Source attribution: Access response.source_nodes to trace which document chunks informed the answer, enabling citation and fact-checking
Theoretical Basis
The query execution follows the two-phase RAG pipeline:
# Abstract algorithm (not real code)
# Phase 1: Retrieve relevant context
query_bundle = QueryBundle(query_str=user_question)
nodes = retriever.retrieve(query_bundle)
nodes = apply_postprocessors(nodes)
# Phase 2: Synthesize response using LLM
response = synthesizer.synthesize(
query=query_bundle,
nodes=nodes,
)
# response.response = generated text
# response.source_nodes = list of NodeWithScore used as context
The separation of retrieval and synthesis is fundamental to the RAG architecture. It enables each phase to be independently configured, tested, and optimized:
| Phase | Responsibility | Tuning Levers |
|---|---|---|
| Retrieve | Find the most relevant nodes from the index for a given query | similarity_top_k, retriever type, embedding model, hybrid search weights |
| Postprocess | Filter, re-rank, or transform retrieved nodes before synthesis | similarity cutoff, re-rankers, metadata filters, keyword exclusions |
| Synthesize | Generate a coherent, grounded answer using the LLM and retrieved context | response_mode, LLM selection, temperature, prompt templates |
The RESPONSE_TYPE returned by query execution contains not just the answer text but also the full list of source_nodes (each a NodeWithScore with text, metadata, and relevance score). This provenance chain is essential for building trustworthy RAG systems that can cite their sources.