Heuristic:PacktPublishing LLM Engineers Handbook Temperature Selection By Task
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Prompt_Engineering, Optimization |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Task-specific temperature selection: 0.0 for metadata extraction and query expansion, 0.01 for production inference, 0.7 for dataset generation, 0.8 for evaluation, and 0.9 for LLM-as-judge scoring.
Description
This heuristic documents the intentional temperature choices across different LLM call sites in the project. Each temperature value reflects the task's tolerance for variation: structured extraction tasks use zero temperature for deterministic outputs, production inference uses near-zero for consistency, and creative generation tasks use higher temperatures to encourage diversity. The LLM-as-judge uses the highest temperature (0.9) to avoid systematic bias in evaluations.
Usage
Use this heuristic when adding new LLM API calls to the project or debugging unexpected LLM behavior. The temperature setting is the single most impactful parameter for controlling output quality and consistency. Choosing the wrong temperature for a task is a common source of bugs.
The Insight (Rule of Thumb)
- Action: Select temperature based on the task type:
- Value:
| Task | Temperature | Rationale |
|---|---|---|
| Self-query metadata extraction | 0.0 | Must produce consistent, parseable JSON |
| Query expansion | 0.0 | Must produce deterministic search reformulations |
| Production RAG inference | 0.01 | Minimal variation for user-facing responses |
| Dataset generation (instruction) | 0.7 | Needs creative but grounded responses |
| vLLM batch evaluation | 0.8 | Allows diverse model outputs for fair comparison |
| LLM-as-judge scoring | 0.9 | High variation prevents systematic scoring bias |
- Trade-off: Lower temperatures reduce output diversity (bad for generation tasks), while higher temperatures increase randomness (bad for extraction tasks). The 0.01 production setting is a compromise: deterministic enough for consistency but not exactly 0, which can cause degenerate repetition in some models.
Reasoning
Metadata extraction and query expansion produce structured outputs (JSON, keyword lists) that downstream components parse programmatically. Any randomness here causes parsing failures or inconsistent retrieval results. Dataset generation needs diversity to avoid repetitive training data, so 0.7 provides good variety while staying grounded in the source context. The LLM-as-judge temperature of 0.9 is deliberately high: at low temperatures, judges tend to converge on the same scores for similar-quality outputs, reducing the signal-to-noise ratio. The 0.01 production inference temperature avoids the "temperature 0 repetition trap" where some models get stuck in loops.
Self-query temperature from `llm_engineering/application/rag/self_query.py:21`:
model = ChatOpenAI(model=settings.OPENAI_MODEL_ID, api_key=settings.OPENAI_API_KEY, temperature=0)
Query expansion temperature from `llm_engineering/application/rag/query_expanison.py:14-22`:
model = ChatOpenAI(model=settings.OPENAI_MODEL_ID, api_key=settings.OPENAI_API_KEY, temperature=0)
Production inference temperature from `llm_engineering/settings.py:58`:
TEMPERATURE_INFERENCE: float = 0.01
Dataset generation temperature from `llm_engineering/application/dataset/generation.py:117-122`:
llm = ChatOpenAI(
model=settings.OPENAI_MODEL_ID,
api_key=settings.OPENAI_API_KEY,
max_tokens=2000 if cls.dataset_type == DatasetType.PREFERENCE else 1200,
temperature=0.7,
)
vLLM evaluation temperature from `llm_engineering/model/evaluation/evaluate.py:42`:
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, min_p=0.05, max_tokens=2048)
LLM-as-judge temperature from `llm_engineering/model/evaluation/evaluate.py:90-102`:
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[...],
response_format={"type": "json_object"},
max_tokens=1000,
temperature=0.9,
)
Related Pages
- Implementation:PacktPublishing_LLM_Engineers_Handbook_SelfQuery_Generate
- Implementation:PacktPublishing_LLM_Engineers_Handbook_QueryExpansion_Generate
- Implementation:PacktPublishing_LLM_Engineers_Handbook_InferenceExecutor_Execute
- Implementation:PacktPublishing_LLM_Engineers_Handbook_DatasetGenerator_Generate
- Implementation:PacktPublishing_LLM_Engineers_Handbook_VLLM_LLM_Generate
- Implementation:PacktPublishing_LLM_Engineers_Handbook_OpenAI_Chat_Completions
- Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_Dataset_Generation
- Principle:PacktPublishing_LLM_Engineers_Handbook_Batch_Inference_Generation
- Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_As_Judge_Evaluation