Heuristic:Lm sys FastChat Greedy Decoding Temperature Threshold
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Optimization |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
Inference sampling heuristic that uses `temperature < 1e-5` or `top_p < 1e-8` as the threshold for greedy decoding, samples top-2 tokens for sentence completion recovery, and applies a T5-specific repetition penalty default of 1.2.
Description
FastChat's inference engine uses several numeric thresholds and fallback mechanisms for robust text generation. Temperature values below 1e-5 are treated as greedy (argmax) decoding because `TemperatureLogitsWarper` doesn't accept 0.0. The engine always samples top-2 tokens, even in greedy mode, to enable a sentence completion recovery mechanism: if generation stops mid-sentence (EOS token produced), it can substitute the second-best token to continue the sentence. MPS devices require an additional workaround where logits are moved to CPU for sampling to avoid MPS backend bugs.
Usage
Use this heuristic when configuring inference parameters or debugging unexpected generation behavior (truncated outputs, repetitive text). The thresholds and recovery mechanisms handle edge cases in production serving.
The Insight (Rule of Thumb)
- Temperature threshold: `temperature < 1e-5` triggers greedy decoding; `1e-5 <= temperature != 1.0` applies temperature scaling; `temperature = 1.0` is a no-op (skipped).
- Top-p threshold: `top_p < 1e-8` also triggers greedy decoding; `1e-8 <= top_p < 1.0` applies nucleus sampling.
- Top-2 sampling: Always sample 2 tokens (greedy: `torch.topk(2)`; sampling: `torch.multinomial(num_samples=2)`) to enable sentence completion fallback.
- T5 repetition penalty: Hardcoded default of 1.2 for T5 models when no penalty is specified.
- MPS workaround: Move logits to CPU as float32 before sampling to avoid MPS backend bugs.
Reasoning
Temperature thresholds: HuggingFace's `TemperatureLogitsWarper` raises an error on `temperature=0.0`. Using `1e-5` as the cutoff provides a practical boundary between "deterministic" and "stochastic" generation while avoiding the API limitation.
Top-2 fallback: The sentence completion hack addresses a common LLM failure mode where the model emits EOS prematurely, cutting off a sentence. By always having a backup token, the system can substitute it and continue generating. The `judge_sent_end` parameter controls this behavior, and it uses the `is_sentence_complete()` utility to detect incomplete sentences.
T5 default penalty: T5 (encoder-decoder) models tend to be more repetitive than decoder-only models at the default `repetition_penalty=1.0`, so FastChat overrides it to 1.2.
MPS to CPU: Apple's MPS backend has known bugs with certain tensor operations. Moving logits to CPU for the final sampling step avoids these issues at minimal performance cost.
Code Evidence
Temperature and top_p thresholds from `fastchat/serve/inference.py:49-57`:
# TemperatureLogitsWarper doesn't accept 0.0, 1.0 makes it a no-op so we skip two cases.
if temperature >= 1e-5 and temperature != 1.0:
processor_list.append(TemperatureLogitsWarper(temperature))
if repetition_penalty > 1.0:
processor_list.append(RepetitionPenaltyLogitsProcessor(repetition_penalty))
if 1e-8 <= top_p < 1.0:
processor_list.append(TopPLogitsWarper(top_p))
if top_k > 0:
processor_list.append(TopKLogitsWarper(top_k))
Greedy vs sampling with top-2 from `fastchat/serve/inference.py:185-191`:
if temperature < 1e-5 or top_p < 1e-8: # greedy
_, indices = torch.topk(last_token_logits, 2)
tokens = [int(index) for index in indices.tolist()]
else:
probs = torch.softmax(last_token_logits, dim=-1)
indices = torch.multinomial(probs, num_samples=2)
tokens = [int(token) for token in indices.tolist()]
Sentence completion recovery from `fastchat/serve/inference.py:242-250`:
# TODO: For the issue of incomplete sentences interrupting output, apply a patch
if judge_sent_end and stopped and not is_sentence_complete(output):
if len(tokens) > 1:
token = tokens[1]
output_ids[-1] = token
else:
output_ids.pop()
stopped = False
sent_interrupt = True
MPS workaround from `fastchat/serve/inference.py:181-183`:
if device == "mps":
# Switch to CPU by avoiding some bugs in mps backend.
last_token_logits = last_token_logits.float().to("cpu")
T5 repetition penalty default from `fastchat/serve/inference.py:383-385`:
# Hardcode T5's default repetition penalty to be 1.2
if is_t5 and repetition_penalty == 1.0:
repetition_penalty = 1.2