Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:PacktPublishing LLM Engineers Handbook RAG Retrieval Parameters

From Leeroopedia




Knowledge Sources
Domains RAG, Information_Retrieval, Optimization
Last Updated 2026-02-08 08:00 GMT

Overview

RAG retrieval pipeline tuned for k=3 documents default, 3-way query expansion, even distribution across data categories, and deterministic temperature for query reformulation.

Description

This heuristic captures the retrieval parameter choices for the RAG (Retrieval-Augmented Generation) pipeline. The pipeline uses a four-stage architecture: self-query metadata extraction, query expansion (3 variants), vector similarity search across 3 data categories, and cross-encoder reranking. Key design decisions include enforcing a minimum k of 3 (one per category), evenly splitting k across categories, and using temperature 0 for deterministic query expansion.

Usage

Use this heuristic when configuring or tuning the RAG pipeline for the RAG Inference workflow. The parameters balance recall (via query expansion) with precision (via reranking) and ensure balanced representation across content types (posts, articles, repositories).

The Insight (Rule of Thumb)

  • Action: Set k=3 minimum, expand queries to 3 variants, use temperature=0 for query processing, and split k evenly across data categories.
  • Value:
    • `k` = 3 (default, minimum enforced by assertion)
    • `expand_to_n_queries` = 3
    • Query expansion temperature = 0.0 (deterministic)
    • Self-query temperature = 0.0 (deterministic)
    • Search limit per category = `k // 3`
    • Production inference temperature = 0.01 (nearly deterministic)
    • Production `top_p` = 0.9
    • Production `max_new_tokens` = 150
  • Trade-off: k=3 with 3 categories means exactly 1 document per category at minimum. For broader results, use k=9 (as in the RAG testing tool). Query expansion triples the search cost but significantly improves recall for ambiguous queries.

Reasoning

The k >= 3 assertion ensures at least one result from each data category (posts, articles, repositories), giving the LLM a diverse context. Temperature=0 for query expansion and self-query ensures reproducible retrieval results, which is critical for debugging and evaluation. The production inference temperature of 0.01 (not exactly 0) allows minimal variation while remaining mostly deterministic. The even k//3 split across categories prevents any single content type from dominating the context window.

Minimum k assertion from `llm_engineering/application/rag/retriever.py:64`:

assert k >= 3, "k should be >= 3"

Category-balanced search from `llm_engineering/application/rag/retriever.py:85`:

limit=k // 3,

Search defaults from `llm_engineering/application/rag/retriever.py:29-34`:

def search(
    self,
    query: str,
    k: int = 3,
    expand_to_n_queries: int = 3,
):

Deterministic query expansion from `llm_engineering/application/rag/query_expanison.py:14-22`:

model = ChatOpenAI(model=settings.OPENAI_MODEL_ID, api_key=settings.OPENAI_API_KEY, temperature=0)

Production inference settings from `llm_engineering/settings.py:58-60`:

TEMPERATURE_INFERENCE: float = 0.01
TOP_P_INFERENCE: float = 0.9
MAX_NEW_TOKENS_INFERENCE: int = 150

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment