Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Query Execution

From Leeroopedia
Knowledge Sources
Domains RAG, LLM_Integration
Last Updated 2026-02-11 00:00 GMT

Overview

The execution step that runs a natural language query through the Retrieval-Augmented Generation (RAG) pipeline, retrieving relevant context from the index and synthesizing a grounded response via an LLM.

Description

Query execution is the culmination of the RAG pipeline. When a user submits a question, the query engine orchestrates a two-phase process: retrieve and synthesize. In the retrieval phase, the engine's retriever searches the index for the most relevant nodes (text chunks with metadata and embeddings). In the synthesis phase, the retrieved nodes are passed as context to an LLM along with the original query, producing a response that is grounded in the indexed knowledge.

The principle follows the template method pattern: the BaseQueryEngine defines the overall query workflow (preprocess, retrieve, postprocess, synthesize), while concrete implementations like RetrieverQueryEngine fill in each step. The result is a Response object that bundles the generated text with provenance information -- the source_nodes that contributed to the answer -- enabling transparency and auditability.

Usage

Use this principle after creating a query engine (via as_query_engine() or manual construction). Key considerations:

  • Synchronous vs. asynchronous: Use query() for synchronous execution and aquery() for async contexts (e.g., web servers, concurrent batch processing)
  • Granular control: Call retrieve() and synthesize() separately when you need to inspect, filter, or augment nodes between retrieval and synthesis
  • Source attribution: Access response.source_nodes to trace which document chunks informed the answer, enabling citation and fact-checking

Theoretical Basis

The query execution follows the two-phase RAG pipeline:

# Abstract algorithm (not real code)
# Phase 1: Retrieve relevant context
query_bundle = QueryBundle(query_str=user_question)
nodes = retriever.retrieve(query_bundle)
nodes = apply_postprocessors(nodes)

# Phase 2: Synthesize response using LLM
response = synthesizer.synthesize(
    query=query_bundle,
    nodes=nodes,
)
# response.response = generated text
# response.source_nodes = list of NodeWithScore used as context

The separation of retrieval and synthesis is fundamental to the RAG architecture. It enables each phase to be independently configured, tested, and optimized:

Phase Responsibility Tuning Levers
Retrieve Find the most relevant nodes from the index for a given query similarity_top_k, retriever type, embedding model, hybrid search weights
Postprocess Filter, re-rank, or transform retrieved nodes before synthesis similarity cutoff, re-rankers, metadata filters, keyword exclusions
Synthesize Generate a coherent, grounded answer using the LLM and retrieved context response_mode, LLM selection, temperature, prompt templates

The RESPONSE_TYPE returned by query execution contains not just the answer text but also the full list of source_nodes (each a NodeWithScore with text, metadata, and relevance score). This provenance chain is essential for building trustworthy RAG systems that can cite their sources.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment