Principle:Run llama Llama index Query Engine Creation

Knowledge Sources	LlamaIndex Query Engine LlamaIndex
Domains	RAG, LLM_Integration
Last Updated	2026-02-11 00:00 GMT

Overview

A composition step that creates a unified query interface over indexed data by combining a retriever with a response synthesizer into a single callable object.

Description

Query engine creation is the pivotal step that transforms an index from a passive data structure into an active question-answering system. The principle follows the facade pattern: a single as_query_engine() call on any index composes two distinct subsystems -- retrieval (finding relevant nodes) and response synthesis (using an LLM to generate an answer from those nodes) -- into one cohesive interface. Under the hood, the index constructs a RetrieverQueryEngine that orchestrates the full retrieve-then-synthesize pipeline.

The key design decision is that the query engine encapsulates all configuration at creation time. The response mode determines how the LLM processes retrieved context, the node postprocessors filter or re-rank nodes before synthesis, and optional templates control the prompts sent to the LLM. Once created, the query engine is ready for repeated use without reconfiguration.

Usage

Use this principle after building or loading an index and before executing queries. Select the appropriate configuration based on:

Response mode: Choose compact (default) for efficiency, refine for iterative multi-pass synthesis, or tree_summarize for hierarchical summarization of large result sets
Node postprocessors: Add re-rankers, keyword filters, or similarity cutoffs to improve retrieval quality before synthesis
Custom LLM: Override the default LLM at query engine creation time for different generation characteristics
Streaming: Enable streaming mode for real-time token-by-token output to the user

Theoretical Basis

The query engine creation follows the Builder Pattern applied to RAG pipelines:

# Abstract algorithm (not real code)
query_engine = index.as_query_engine(
    llm=language_model,
    response_mode=synthesis_strategy,       # compact | refine | tree_summarize
    node_postprocessors=[reranker, filter],  # optional processing pipeline
    similarity_top_k=retrieval_count,        # number of nodes to retrieve
    streaming=enable_streaming,              # token-by-token output
)
# query_engine is now a BaseQueryEngine ready for .query() calls

The three primary response modes represent different strategies for synthesizing answers from multiple retrieved context chunks:

Response Mode	Strategy	Best For
compact	Stuffs as many nodes as possible into a single LLM call, reducing the number of API calls	General-purpose queries with moderate context
refine	Iterates through each node sequentially, refining the answer with each additional piece of context	Detailed answers requiring thorough consideration of all evidence
tree_summarize	Recursively summarizes nodes in a bottom-up tree structure, then synthesizes from summaries	Large result sets where hierarchical summarization reduces noise

The response mode selection directly impacts both answer quality and latency. compact minimizes LLM calls (lowest cost), refine maximizes context utilization (highest fidelity), and tree_summarize balances breadth with coherence for large retrieval sets.

Related Pages

Implemented By

Implementation:Run_llama_Llama_index_Index_As_Query_Engine

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment