Heuristic:LLMBook zh LLMBook zh github io Greedy Decoding Temperature Zero
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Inference |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
Set temperature=0 for deterministic (greedy) decoding when reproducibility and factual accuracy are the priority.
Description
The vLLM inference example uses temperature=0 which produces greedy decoding: at each step, the token with the highest probability is selected. This eliminates randomness in generation, making output deterministic and reproducible. The example also sets presence_penalty=0 and frequency_penalty=0, meaning no penalty is applied for token repetition.
Usage
Use temperature=0 for evaluation, benchmarking, and factual question answering where reproducibility is critical. Switch to temperature > 0 (e.g., 0.7-1.0) for creative text generation, brainstorming, or diversity in responses. Use top-p sampling as an alternative when you want controlled randomness.
The Insight (Rule of Thumb)
- Action: Set `temperature=0` in `SamplingParams` for deterministic output.
- Value: temperature=0, presence_penalty=0, frequency_penalty=0, max_tokens=2048.
- Trade-off: Greedy decoding is deterministic but may produce repetitive or generic text for open-ended tasks. For creative tasks, use temperature=0.7-1.0 with top_p=0.9.
- Max Tokens: 2048 matches the training context window size, ensuring generated text stays within trained bounds.
Reasoning
Temperature scales the logits before softmax: `softmax(logits / T)`. At T=0, this becomes argmax (greedy). At T=1.0, standard softmax probabilities are used. At T > 1.0, the distribution flattens (more random). For LLM evaluation and Q&A tasks, greedy decoding produces the most likely answer, which is typically the most accurate. The LLaMA-2-Chat prompt format (`[INST] ... [/INST]`) is used in the example, which is required for this specific model variant.
Code Evidence:
Sampling parameters from `code/9.1 vLLM实践.py:14-19`:
sampling_params = vllm.SamplingParams(
temperature=0, # 温度设置为0表示贪心搜索
max_tokens=2048, # 新生成token数上限
presence_penalty=0, # 存在惩罚系数
frequency_penalty=0, # 频率惩罚系数
)
LLaMA-2-Chat prompt format from `code/9.1 vLLM实践.py:4-8`:
# 符合LLaMA-2 Chat格式的三个提示
prompts = [
'[INST] How are you? [/INST]',
'[INST] 1 + 1 = ? [/INST]',
'[INST] Can you tell me a joke? [/INST]',
]