Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Query Engine Creation

From Leeroopedia
Knowledge Sources
Domains RAG, LLM_Integration
Last Updated 2026-02-11 00:00 GMT

Overview

A composition step that creates a unified query interface over indexed data by combining a retriever with a response synthesizer into a single callable object.

Description

Query engine creation is the pivotal step that transforms an index from a passive data structure into an active question-answering system. The principle follows the facade pattern: a single as_query_engine() call on any index composes two distinct subsystems -- retrieval (finding relevant nodes) and response synthesis (using an LLM to generate an answer from those nodes) -- into one cohesive interface. Under the hood, the index constructs a RetrieverQueryEngine that orchestrates the full retrieve-then-synthesize pipeline.

The key design decision is that the query engine encapsulates all configuration at creation time. The response mode determines how the LLM processes retrieved context, the node postprocessors filter or re-rank nodes before synthesis, and optional templates control the prompts sent to the LLM. Once created, the query engine is ready for repeated use without reconfiguration.

Usage

Use this principle after building or loading an index and before executing queries. Select the appropriate configuration based on:

  • Response mode: Choose compact (default) for efficiency, refine for iterative multi-pass synthesis, or tree_summarize for hierarchical summarization of large result sets
  • Node postprocessors: Add re-rankers, keyword filters, or similarity cutoffs to improve retrieval quality before synthesis
  • Custom LLM: Override the default LLM at query engine creation time for different generation characteristics
  • Streaming: Enable streaming mode for real-time token-by-token output to the user

Theoretical Basis

The query engine creation follows the Builder Pattern applied to RAG pipelines:

# Abstract algorithm (not real code)
query_engine = index.as_query_engine(
    llm=language_model,
    response_mode=synthesis_strategy,       # compact | refine | tree_summarize
    node_postprocessors=[reranker, filter],  # optional processing pipeline
    similarity_top_k=retrieval_count,        # number of nodes to retrieve
    streaming=enable_streaming,              # token-by-token output
)
# query_engine is now a BaseQueryEngine ready for .query() calls

The three primary response modes represent different strategies for synthesizing answers from multiple retrieved context chunks:

Response Mode Strategy Best For
compact Stuffs as many nodes as possible into a single LLM call, reducing the number of API calls General-purpose queries with moderate context
refine Iterates through each node sequentially, refining the answer with each additional piece of context Detailed answers requiring thorough consideration of all evidence
tree_summarize Recursively summarizes nodes in a bottom-up tree structure, then synthesizes from summaries Large result sets where hierarchical summarization reduces noise

The response mode selection directly impacts both answer quality and latency. compact minimizes LLM calls (lowest cost), refine maximizes context utilization (highest fidelity), and tree_summarize balances breadth with coherence for large retrieval sets.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment