Heuristic:Lucidrains X transformers Sampling Temperature Strategy

Knowledge Sources	x-transformers Min-P Sampling Contrastive Decoding
Domains	LLMs, Optimization
Last Updated	2026-02-08 18:00 GMT

Overview

Guidelines for selecting and tuning temperature and sampling strategies during autoregressive text generation in x-transformers.

Description

x-transformers provides multiple sampling strategies that can be combined: temperature scaling, top-k, top-p (nucleus), top-a, min-p, and contrastive decoding. Each has default values tuned for different generation behaviors. This heuristic captures the recommended defaults and their trade-offs.

Usage

Use this heuristic when configuring the `generate()` method on `AutoregressiveWrapper` or `NonAutoregressiveWrapper` and choosing between sampling strategies for different quality/diversity trade-offs.

The Insight (Rule of Thumb)

Temperature:
- Default: `1.0` (no scaling)
- Lower (0.1-0.7): More deterministic, higher quality for factual tasks
- Higher (1.0-1.5): More diverse, creative output
- Minimum clamp: `1e-6` (autoregressive) or `1e-3` (non-autoregressive) to prevent division by zero

Top-K: Set `filter_thres` on generate; filters to top k% of vocabulary
- Non-autoregressive default: `filter_thres=0.7` (keep top 70%)

Min-P (recommended for quality):
- Default: `min_p=0.1` (keep tokens with probability >= 10% of max)
- Adapts dynamically to confidence level of each prediction

Top-A:
- Defaults: `min_p_pow=2.0`, `min_p_ratio=0.02`
- More aggressive pruning than min-p

Contrastive Decoding:
- Defaults: `alpha=0.1`, `beta=0.5`
- Requires both expert and amateur models

Non-Autoregressive Temperature Annealing:
- Starts at `start_temperature=1.0`, decreases linearly with remaining steps
- Formula: `temperature = start_temperature * (steps_remaining / total_steps)`

Reasoning

The Gumbel sampling implementation adds Gumbel noise before taking argmax, which introduces stochasticity equivalent to sampling from the softmax distribution. Temperature controls the sharpness: at temperature 0 it becomes greedy (argmax), at high temperature it becomes uniform random.

Min-P sampling (arxiv 2407.01082) is particularly effective because its threshold adapts to prediction confidence: when the model is confident (high max probability), weaker alternatives are pruned aggressively; when uncertain, more options are retained.

For non-autoregressive generation, temperature annealing ensures early steps explore broadly (high temperature) while later steps converge (low temperature), matching the progressive refinement nature of iterative unmasking.

Code Evidence

Gumbel sampling with temperature floor from `autoregressive_wrapper.py:46-48`:

def gumbel_sample(logits, temperature = 1., eps = 1e-6):
    noise = gumbel_noise(logits)
    return ((logits / max(temperature, eps)) + noise).argmax(dim = -1)

Min-P implementation from `autoregressive_wrapper.py:121-125`:

def min_p(logits, min_p = 0.1):
    probs = logits.softmax(dim = -1)
    max_probs = probs.amax(dim = -1, keepdim = True)
    limit = min_p * max_probs
    return torch.where(probs < limit, float('-inf'), logits)

Non-autoregressive temperature annealing from `nonautoregressive_wrapper.py:240-245`:

annealing_scale = steps_until_x0 / self.steps
temperature = start_temperature * annealing_scale
probs = (logits / max(temperature, 1e-3)).softmax(dim = -1)
sampled_ids = gumbel_sample(logits, temperature = max(temperature, 1e-3))

Contrastive decoding from `autoregressive_wrapper.py:138-152`:

def contrastive_decode_fn(
    expert_logits,
    amateur_logits,
    alpha = 0.1,
    beta = 0.5
):
    cutoff = log(alpha) + expert_logits.amax(dim = -1, keepdim = True)
    diffs = (1 + beta) * expert_logits - beta * amateur_logits
    contrastive_decode_logits = diffs.masked_fill(expert_logits < cutoff, -float('inf'))
    return contrastive_decode_logits

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment