Principle:Run llama Llama index Query Engine Creation
| Knowledge Sources | |
|---|---|
| Domains | RAG, LLM_Integration |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
A composition step that creates a unified query interface over indexed data by combining a retriever with a response synthesizer into a single callable object.
Description
Query engine creation is the pivotal step that transforms an index from a passive data structure into an active question-answering system. The principle follows the facade pattern: a single as_query_engine() call on any index composes two distinct subsystems -- retrieval (finding relevant nodes) and response synthesis (using an LLM to generate an answer from those nodes) -- into one cohesive interface. Under the hood, the index constructs a RetrieverQueryEngine that orchestrates the full retrieve-then-synthesize pipeline.
The key design decision is that the query engine encapsulates all configuration at creation time. The response mode determines how the LLM processes retrieved context, the node postprocessors filter or re-rank nodes before synthesis, and optional templates control the prompts sent to the LLM. Once created, the query engine is ready for repeated use without reconfiguration.
Usage
Use this principle after building or loading an index and before executing queries. Select the appropriate configuration based on:
- Response mode: Choose compact (default) for efficiency, refine for iterative multi-pass synthesis, or tree_summarize for hierarchical summarization of large result sets
- Node postprocessors: Add re-rankers, keyword filters, or similarity cutoffs to improve retrieval quality before synthesis
- Custom LLM: Override the default LLM at query engine creation time for different generation characteristics
- Streaming: Enable streaming mode for real-time token-by-token output to the user
Theoretical Basis
The query engine creation follows the Builder Pattern applied to RAG pipelines:
# Abstract algorithm (not real code)
query_engine = index.as_query_engine(
llm=language_model,
response_mode=synthesis_strategy, # compact | refine | tree_summarize
node_postprocessors=[reranker, filter], # optional processing pipeline
similarity_top_k=retrieval_count, # number of nodes to retrieve
streaming=enable_streaming, # token-by-token output
)
# query_engine is now a BaseQueryEngine ready for .query() calls
The three primary response modes represent different strategies for synthesizing answers from multiple retrieved context chunks:
| Response Mode | Strategy | Best For |
|---|---|---|
| compact | Stuffs as many nodes as possible into a single LLM call, reducing the number of API calls | General-purpose queries with moderate context |
| refine | Iterates through each node sequentially, refining the answer with each additional piece of context | Detailed answers requiring thorough consideration of all evidence |
| tree_summarize | Recursively summarizes nodes in a bottom-up tree structure, then synthesizes from summaries | Large result sets where hierarchical summarization reduces noise |
The response mode selection directly impacts both answer quality and latency. compact minimizes LLM calls (lowest cost), refine maximizes context utilization (highest fidelity), and tree_summarize balances breadth with coherence for large retrieval sets.