Principle:Romsto Speculative Decoding Logits Processing

Knowledge Sources	The Curious Case of Neural Text Degeneration Hierarchical Neural Story Generation
Domains	NLP, Sampling, Probability_Theory
Last Updated	2026-02-14 04:30 GMT

Overview

A family of token sampling strategies that transform raw model logits into probability distributions and select tokens, including greedy, multinomial, top-k, nucleus (top-p), and combined top-k/nucleus methods.

Description

Logits Processing encompasses the techniques used to convert a language model's raw output logits into a probability distribution and then sample a token from that distribution. The choice of sampling strategy profoundly affects the quality, diversity, and coherence of generated text.

The key strategies are:

Greedy decoding: Always selects the highest-probability token. Deterministic but can lead to repetitive, degenerate text.
Multinomial sampling: Samples proportionally from the full distribution scaled by a temperature parameter. Higher temperature increases diversity.
Top-k sampling: Restricts the candidate set to the k highest-probability tokens before sampling. Prevents sampling from the long tail of unlikely tokens.
Nucleus (top-p) sampling: Dynamically selects the smallest set of tokens whose cumulative probability exceeds threshold p. Adapts the candidate set size based on the distribution's entropy.
Top-k + Nucleus: Applies top-k filtering first, then nucleus filtering, combining both truncation methods.

All strategies share a common interface: they accept logits, apply temperature-scaled softmax, optionally filter low-probability tokens, and then sample from the resulting distribution.

Usage

Use this principle when generating text from a language model and need to control the trade-off between output quality and diversity. Greedy decoding is appropriate for tasks requiring deterministic output (e.g., factual Q&A). Nucleus sampling is preferred for creative text generation where diversity is valued. The choice of strategy also affects speculative decoding: both the drafter and target models must use the same sampling strategy for correct rejection sampling.

Theoretical Basis

All logits processors follow a two-stage pipeline:

Process: Transform raw logits (optionally filtering low-probability tokens)
Sample: Convert processed logits to probabilities via temperature-scaled softmax, then select a token

$probs = softmax (\frac{process (logits)}{T})$

Where T is the temperature parameter.

Top-k filtering sets all logits below the k-th highest value to $- \infty$ :

# Abstract top-k filtering
threshold = sorted(logits, descending=True)[k]
logits[logits < threshold] = -inf

Nucleus filtering finds the smallest set of tokens with cumulative probability >= p:

# Abstract nucleus filtering
sorted_probs = sort(softmax(logits), descending=True)
cumulative = cumsum(sorted_probs)
mask = cumulative > top_p
logits[mask] = -inf  # after restoring original order

Related Pages

Implemented By

Implementation:Romsto_Speculative_Decoding_LogitsProcessor_Hierarchy

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment