Heuristic:PacktPublishing LLM Engineers Handbook RAG Retrieval Parameters

Knowledge Sources	LLM Engineers Handbook
Domains	RAG, Information_Retrieval, Optimization
Last Updated	2026-02-08 08:00 GMT

Overview

RAG retrieval pipeline tuned for k=3 documents default, 3-way query expansion, even distribution across data categories, and deterministic temperature for query reformulation.

Description

This heuristic captures the retrieval parameter choices for the RAG (Retrieval-Augmented Generation) pipeline. The pipeline uses a four-stage architecture: self-query metadata extraction, query expansion (3 variants), vector similarity search across 3 data categories, and cross-encoder reranking. Key design decisions include enforcing a minimum k of 3 (one per category), evenly splitting k across categories, and using temperature 0 for deterministic query expansion.

Usage

Use this heuristic when configuring or tuning the RAG pipeline for the RAG Inference workflow. The parameters balance recall (via query expansion) with precision (via reranking) and ensure balanced representation across content types (posts, articles, repositories).

The Insight (Rule of Thumb)

Action: Set k=3 minimum, expand queries to 3 variants, use temperature=0 for query processing, and split k evenly across data categories.
Value:
- `k` = 3 (default, minimum enforced by assertion)
- `expand_to_n_queries` = 3
- Query expansion temperature = 0.0 (deterministic)
- Self-query temperature = 0.0 (deterministic)
- Search limit per category = `k // 3`
- Production inference temperature = 0.01 (nearly deterministic)
- Production `top_p` = 0.9
- Production `max_new_tokens` = 150
Trade-off: k=3 with 3 categories means exactly 1 document per category at minimum. For broader results, use k=9 (as in the RAG testing tool). Query expansion triples the search cost but significantly improves recall for ambiguous queries.

Reasoning

The k >= 3 assertion ensures at least one result from each data category (posts, articles, repositories), giving the LLM a diverse context. Temperature=0 for query expansion and self-query ensures reproducible retrieval results, which is critical for debugging and evaluation. The production inference temperature of 0.01 (not exactly 0) allows minimal variation while remaining mostly deterministic. The even k//3 split across categories prevents any single content type from dominating the context window.

Minimum k assertion from `llm_engineering/application/rag/retriever.py:64`:

assert k >= 3, "k should be >= 3"

Category-balanced search from `llm_engineering/application/rag/retriever.py:85`:

limit=k // 3,

Search defaults from `llm_engineering/application/rag/retriever.py:29-34`:

def search(
    self,
    query: str,
    k: int = 3,
    expand_to_n_queries: int = 3,
):

Deterministic query expansion from `llm_engineering/application/rag/query_expanison.py:14-22`:

model = ChatOpenAI(model=settings.OPENAI_MODEL_ID, api_key=settings.OPENAI_API_KEY, temperature=0)

Production inference settings from `llm_engineering/settings.py:58-60`:

TEMPERATURE_INFERENCE: float = 0.01
TOP_P_INFERENCE: float = 0.9
MAX_NEW_TOKENS_INFERENCE: int = 150

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment