Heuristic:LLMBook zh LLMBook zh github io Greedy Decoding Temperature Zero

Knowledge Sources	LLMBook-zh vLLM Sampling Parameters
Domains	LLMs, Inference
Last Updated	2026-02-08 04:30 GMT

Overview

Set temperature=0 for deterministic (greedy) decoding when reproducibility and factual accuracy are the priority.

Description

The vLLM inference example uses temperature=0 which produces greedy decoding: at each step, the token with the highest probability is selected. This eliminates randomness in generation, making output deterministic and reproducible. The example also sets presence_penalty=0 and frequency_penalty=0, meaning no penalty is applied for token repetition.

Usage

Use temperature=0 for evaluation, benchmarking, and factual question answering where reproducibility is critical. Switch to temperature > 0 (e.g., 0.7-1.0) for creative text generation, brainstorming, or diversity in responses. Use top-p sampling as an alternative when you want controlled randomness.

The Insight (Rule of Thumb)

Action: Set `temperature=0` in `SamplingParams` for deterministic output.
Value: temperature=0, presence_penalty=0, frequency_penalty=0, max_tokens=2048.
Trade-off: Greedy decoding is deterministic but may produce repetitive or generic text for open-ended tasks. For creative tasks, use temperature=0.7-1.0 with top_p=0.9.
Max Tokens: 2048 matches the training context window size, ensuring generated text stays within trained bounds.

Reasoning

Temperature scales the logits before softmax: `softmax(logits / T)`. At T=0, this becomes argmax (greedy). At T=1.0, standard softmax probabilities are used. At T > 1.0, the distribution flattens (more random). For LLM evaluation and Q&A tasks, greedy decoding produces the most likely answer, which is typically the most accurate. The LLaMA-2-Chat prompt format (`[INST] ... [/INST]`) is used in the example, which is required for this specific model variant.

Code Evidence:

Sampling parameters from `code/9.1 vLLM实践.py:14-19`:

sampling_params = vllm.SamplingParams(
    temperature=0,       # 温度设置为0表示贪心搜索
    max_tokens=2048,     # 新生成token数上限
    presence_penalty=0,  # 存在惩罚系数
    frequency_penalty=0, # 频率惩罚系数
)

LLaMA-2-Chat prompt format from `code/9.1 vLLM实践.py:4-8`:

# 符合LLaMA-2 Chat格式的三个提示
prompts = [
    '[INST] How are you? [/INST]',
    '[INST] 1 + 1 = ? [/INST]',
    '[INST] Can you tell me a joke? [/INST]',
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment