Heuristic:PacktPublishing LLM Engineers Handbook RAG Retrieval Parameters
| Knowledge Sources | |
|---|---|
| Domains | RAG, Information_Retrieval, Optimization |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
RAG retrieval pipeline tuned for k=3 documents default, 3-way query expansion, even distribution across data categories, and deterministic temperature for query reformulation.
Description
This heuristic captures the retrieval parameter choices for the RAG (Retrieval-Augmented Generation) pipeline. The pipeline uses a four-stage architecture: self-query metadata extraction, query expansion (3 variants), vector similarity search across 3 data categories, and cross-encoder reranking. Key design decisions include enforcing a minimum k of 3 (one per category), evenly splitting k across categories, and using temperature 0 for deterministic query expansion.
Usage
Use this heuristic when configuring or tuning the RAG pipeline for the RAG Inference workflow. The parameters balance recall (via query expansion) with precision (via reranking) and ensure balanced representation across content types (posts, articles, repositories).
The Insight (Rule of Thumb)
- Action: Set k=3 minimum, expand queries to 3 variants, use temperature=0 for query processing, and split k evenly across data categories.
- Value:
- `k` = 3 (default, minimum enforced by assertion)
- `expand_to_n_queries` = 3
- Query expansion temperature = 0.0 (deterministic)
- Self-query temperature = 0.0 (deterministic)
- Search limit per category = `k // 3`
- Production inference temperature = 0.01 (nearly deterministic)
- Production `top_p` = 0.9
- Production `max_new_tokens` = 150
- Trade-off: k=3 with 3 categories means exactly 1 document per category at minimum. For broader results, use k=9 (as in the RAG testing tool). Query expansion triples the search cost but significantly improves recall for ambiguous queries.
Reasoning
The k >= 3 assertion ensures at least one result from each data category (posts, articles, repositories), giving the LLM a diverse context. Temperature=0 for query expansion and self-query ensures reproducible retrieval results, which is critical for debugging and evaluation. The production inference temperature of 0.01 (not exactly 0) allows minimal variation while remaining mostly deterministic. The even k//3 split across categories prevents any single content type from dominating the context window.
Minimum k assertion from `llm_engineering/application/rag/retriever.py:64`:
assert k >= 3, "k should be >= 3"
Category-balanced search from `llm_engineering/application/rag/retriever.py:85`:
limit=k // 3,
Search defaults from `llm_engineering/application/rag/retriever.py:29-34`:
def search(
self,
query: str,
k: int = 3,
expand_to_n_queries: int = 3,
):
Deterministic query expansion from `llm_engineering/application/rag/query_expanison.py:14-22`:
model = ChatOpenAI(model=settings.OPENAI_MODEL_ID, api_key=settings.OPENAI_API_KEY, temperature=0)
Production inference settings from `llm_engineering/settings.py:58-60`:
TEMPERATURE_INFERENCE: float = 0.01
TOP_P_INFERENCE: float = 0.9
MAX_NEW_TOKENS_INFERENCE: int = 150
Related Pages
- Implementation:PacktPublishing_LLM_Engineers_Handbook_QueryExpansion_Generate
- Implementation:PacktPublishing_LLM_Engineers_Handbook_SelfQuery_Generate
- Implementation:PacktPublishing_LLM_Engineers_Handbook_VectorBaseDocument_Search
- Implementation:PacktPublishing_LLM_Engineers_Handbook_Reranker_Generate
- Implementation:PacktPublishing_LLM_Engineers_Handbook_InferenceExecutor_Execute
- Principle:PacktPublishing_LLM_Engineers_Handbook_Query_Expansion
- Principle:PacktPublishing_LLM_Engineers_Handbook_Vector_Similarity_Search
- Principle:PacktPublishing_LLM_Engineers_Handbook_Cross_Encoder_Reranking