Heuristic:Lm sys FastChat Greedy Decoding Temperature Threshold

Knowledge Sources	lm-sys/FastChat Inference sampling logic
Domains	LLMs, Optimization
Last Updated	2026-02-07 04:00 GMT

Overview

Inference sampling heuristic that uses `temperature < 1e-5` or `top_p < 1e-8` as the threshold for greedy decoding, samples top-2 tokens for sentence completion recovery, and applies a T5-specific repetition penalty default of 1.2.

Description

FastChat's inference engine uses several numeric thresholds and fallback mechanisms for robust text generation. Temperature values below 1e-5 are treated as greedy (argmax) decoding because `TemperatureLogitsWarper` doesn't accept 0.0. The engine always samples top-2 tokens, even in greedy mode, to enable a sentence completion recovery mechanism: if generation stops mid-sentence (EOS token produced), it can substitute the second-best token to continue the sentence. MPS devices require an additional workaround where logits are moved to CPU for sampling to avoid MPS backend bugs.

Usage

Use this heuristic when configuring inference parameters or debugging unexpected generation behavior (truncated outputs, repetitive text). The thresholds and recovery mechanisms handle edge cases in production serving.

The Insight (Rule of Thumb)

Temperature threshold: `temperature < 1e-5` triggers greedy decoding; `1e-5 <= temperature != 1.0` applies temperature scaling; `temperature = 1.0` is a no-op (skipped).
Top-p threshold: `top_p < 1e-8` also triggers greedy decoding; `1e-8 <= top_p < 1.0` applies nucleus sampling.
Top-2 sampling: Always sample 2 tokens (greedy: `torch.topk(2)`; sampling: `torch.multinomial(num_samples=2)`) to enable sentence completion fallback.
T5 repetition penalty: Hardcoded default of 1.2 for T5 models when no penalty is specified.
MPS workaround: Move logits to CPU as float32 before sampling to avoid MPS backend bugs.

Reasoning

Temperature thresholds: HuggingFace's `TemperatureLogitsWarper` raises an error on `temperature=0.0`. Using `1e-5` as the cutoff provides a practical boundary between "deterministic" and "stochastic" generation while avoiding the API limitation.

Top-2 fallback: The sentence completion hack addresses a common LLM failure mode where the model emits EOS prematurely, cutting off a sentence. By always having a backup token, the system can substitute it and continue generating. The `judge_sent_end` parameter controls this behavior, and it uses the `is_sentence_complete()` utility to detect incomplete sentences.

T5 default penalty: T5 (encoder-decoder) models tend to be more repetitive than decoder-only models at the default `repetition_penalty=1.0`, so FastChat overrides it to 1.2.

MPS to CPU: Apple's MPS backend has known bugs with certain tensor operations. Moving logits to CPU for the final sampling step avoids these issues at minimal performance cost.

Code Evidence

Temperature and top_p thresholds from `fastchat/serve/inference.py:49-57`:

# TemperatureLogitsWarper doesn't accept 0.0, 1.0 makes it a no-op so we skip two cases.
if temperature >= 1e-5 and temperature != 1.0:
    processor_list.append(TemperatureLogitsWarper(temperature))
if repetition_penalty > 1.0:
    processor_list.append(RepetitionPenaltyLogitsProcessor(repetition_penalty))
if 1e-8 <= top_p < 1.0:
    processor_list.append(TopPLogitsWarper(top_p))
if top_k > 0:
    processor_list.append(TopKLogitsWarper(top_k))

Greedy vs sampling with top-2 from `fastchat/serve/inference.py:185-191`:

if temperature < 1e-5 or top_p < 1e-8:  # greedy
    _, indices = torch.topk(last_token_logits, 2)
    tokens = [int(index) for index in indices.tolist()]
else:
    probs = torch.softmax(last_token_logits, dim=-1)
    indices = torch.multinomial(probs, num_samples=2)
    tokens = [int(token) for token in indices.tolist()]

Sentence completion recovery from `fastchat/serve/inference.py:242-250`:

# TODO: For the issue of incomplete sentences interrupting output, apply a patch
if judge_sent_end and stopped and not is_sentence_complete(output):
    if len(tokens) > 1:
        token = tokens[1]
        output_ids[-1] = token
    else:
        output_ids.pop()
    stopped = False
    sent_interrupt = True

MPS workaround from `fastchat/serve/inference.py:181-183`:

if device == "mps":
    # Switch to CPU by avoiding some bugs in mps backend.
    last_token_logits = last_token_logits.float().to("cpu")

T5 repetition penalty default from `fastchat/serve/inference.py:383-385`:

# Hardcode T5's default repetition penalty to be 1.2
if is_t5 and repetition_penalty == 1.0:
    repetition_penalty = 1.2

Related Pages

Implementation:Lm_sys_FastChat_ModelWorker_Load_And_Generate

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment