Principle:Romsto Speculative Decoding Logits Processing
| Knowledge Sources | |
|---|---|
| Domains | NLP, Sampling, Probability_Theory |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
A family of token sampling strategies that transform raw model logits into probability distributions and select tokens, including greedy, multinomial, top-k, nucleus (top-p), and combined top-k/nucleus methods.
Description
Logits Processing encompasses the techniques used to convert a language model's raw output logits into a probability distribution and then sample a token from that distribution. The choice of sampling strategy profoundly affects the quality, diversity, and coherence of generated text.
The key strategies are:
- Greedy decoding: Always selects the highest-probability token. Deterministic but can lead to repetitive, degenerate text.
- Multinomial sampling: Samples proportionally from the full distribution scaled by a temperature parameter. Higher temperature increases diversity.
- Top-k sampling: Restricts the candidate set to the k highest-probability tokens before sampling. Prevents sampling from the long tail of unlikely tokens.
- Nucleus (top-p) sampling: Dynamically selects the smallest set of tokens whose cumulative probability exceeds threshold p. Adapts the candidate set size based on the distribution's entropy.
- Top-k + Nucleus: Applies top-k filtering first, then nucleus filtering, combining both truncation methods.
All strategies share a common interface: they accept logits, apply temperature-scaled softmax, optionally filter low-probability tokens, and then sample from the resulting distribution.
Usage
Use this principle when generating text from a language model and need to control the trade-off between output quality and diversity. Greedy decoding is appropriate for tasks requiring deterministic output (e.g., factual Q&A). Nucleus sampling is preferred for creative text generation where diversity is valued. The choice of strategy also affects speculative decoding: both the drafter and target models must use the same sampling strategy for correct rejection sampling.
Theoretical Basis
All logits processors follow a two-stage pipeline:
- Process: Transform raw logits (optionally filtering low-probability tokens)
- Sample: Convert processed logits to probabilities via temperature-scaled softmax, then select a token
Where T is the temperature parameter.
Top-k filtering sets all logits below the k-th highest value to :
# Abstract top-k filtering
threshold = sorted(logits, descending=True)[k]
logits[logits < threshold] = -inf
Nucleus filtering finds the smallest set of tokens with cumulative probability >= p:
# Abstract nucleus filtering
sorted_probs = sort(softmax(logits), descending=True)
cumulative = cumsum(sorted_probs)
mask = cumulative > top_p
logits[mask] = -inf # after restoring original order