Principle:Run llama Llama index Query Execution

Knowledge Sources	LlamaIndex Query Engine LlamaIndex
Domains	RAG, LLM_Integration
Last Updated	2026-02-11 00:00 GMT

Overview

The execution step that runs a natural language query through the Retrieval-Augmented Generation (RAG) pipeline, retrieving relevant context from the index and synthesizing a grounded response via an LLM.

Description

Query execution is the culmination of the RAG pipeline. When a user submits a question, the query engine orchestrates a two-phase process: retrieve and synthesize. In the retrieval phase, the engine's retriever searches the index for the most relevant nodes (text chunks with metadata and embeddings). In the synthesis phase, the retrieved nodes are passed as context to an LLM along with the original query, producing a response that is grounded in the indexed knowledge.

The principle follows the template method pattern: the BaseQueryEngine defines the overall query workflow (preprocess, retrieve, postprocess, synthesize), while concrete implementations like RetrieverQueryEngine fill in each step. The result is a Response object that bundles the generated text with provenance information -- the source_nodes that contributed to the answer -- enabling transparency and auditability.

Usage

Use this principle after creating a query engine (via as_query_engine() or manual construction). Key considerations:

Synchronous vs. asynchronous: Use query() for synchronous execution and aquery() for async contexts (e.g., web servers, concurrent batch processing)
Granular control: Call retrieve() and synthesize() separately when you need to inspect, filter, or augment nodes between retrieval and synthesis
Source attribution: Access response.source_nodes to trace which document chunks informed the answer, enabling citation and fact-checking

Theoretical Basis

The query execution follows the two-phase RAG pipeline:

# Abstract algorithm (not real code)
# Phase 1: Retrieve relevant context
query_bundle = QueryBundle(query_str=user_question)
nodes = retriever.retrieve(query_bundle)
nodes = apply_postprocessors(nodes)

# Phase 2: Synthesize response using LLM
response = synthesizer.synthesize(
    query=query_bundle,
    nodes=nodes,
)
# response.response = generated text
# response.source_nodes = list of NodeWithScore used as context

The separation of retrieval and synthesis is fundamental to the RAG architecture. It enables each phase to be independently configured, tested, and optimized:

Phase	Responsibility	Tuning Levers
Retrieve	Find the most relevant nodes from the index for a given query	similarity_top_k, retriever type, embedding model, hybrid search weights
Postprocess	Filter, re-rank, or transform retrieved nodes before synthesis	similarity cutoff, re-rankers, metadata filters, keyword exclusions
Synthesize	Generate a coherent, grounded answer using the LLM and retrieved context	response_mode, LLM selection, temperature, prompt templates

The RESPONSE_TYPE returned by query execution contains not just the answer text but also the full list of source_nodes (each a NodeWithScore with text, metadata, and relevance score). This provenance chain is essential for building trustworthy RAG systems that can cite their sources.

Related Pages

Implemented By

Implementation:Run_llama_Llama_index_Query_Engine_Query

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment