Heuristic:Lucidrains X transformers Sampling Temperature Strategy
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Optimization |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Guidelines for selecting and tuning temperature and sampling strategies during autoregressive text generation in x-transformers.
Description
x-transformers provides multiple sampling strategies that can be combined: temperature scaling, top-k, top-p (nucleus), top-a, min-p, and contrastive decoding. Each has default values tuned for different generation behaviors. This heuristic captures the recommended defaults and their trade-offs.
Usage
Use this heuristic when configuring the `generate()` method on `AutoregressiveWrapper` or `NonAutoregressiveWrapper` and choosing between sampling strategies for different quality/diversity trade-offs.
The Insight (Rule of Thumb)
- Temperature:
- Default: `1.0` (no scaling)
- Lower (0.1-0.7): More deterministic, higher quality for factual tasks
- Higher (1.0-1.5): More diverse, creative output
- Minimum clamp: `1e-6` (autoregressive) or `1e-3` (non-autoregressive) to prevent division by zero
- Top-K: Set `filter_thres` on generate; filters to top k% of vocabulary
- Non-autoregressive default: `filter_thres=0.7` (keep top 70%)
- Min-P (recommended for quality):
- Default: `min_p=0.1` (keep tokens with probability >= 10% of max)
- Adapts dynamically to confidence level of each prediction
- Top-A:
- Defaults: `min_p_pow=2.0`, `min_p_ratio=0.02`
- More aggressive pruning than min-p
- Contrastive Decoding:
- Defaults: `alpha=0.1`, `beta=0.5`
- Requires both expert and amateur models
- Non-Autoregressive Temperature Annealing:
- Starts at `start_temperature=1.0`, decreases linearly with remaining steps
- Formula: `temperature = start_temperature * (steps_remaining / total_steps)`
Reasoning
The Gumbel sampling implementation adds Gumbel noise before taking argmax, which introduces stochasticity equivalent to sampling from the softmax distribution. Temperature controls the sharpness: at temperature 0 it becomes greedy (argmax), at high temperature it becomes uniform random.
Min-P sampling (arxiv 2407.01082) is particularly effective because its threshold adapts to prediction confidence: when the model is confident (high max probability), weaker alternatives are pruned aggressively; when uncertain, more options are retained.
For non-autoregressive generation, temperature annealing ensures early steps explore broadly (high temperature) while later steps converge (low temperature), matching the progressive refinement nature of iterative unmasking.
Code Evidence
Gumbel sampling with temperature floor from `autoregressive_wrapper.py:46-48`:
def gumbel_sample(logits, temperature = 1., eps = 1e-6):
noise = gumbel_noise(logits)
return ((logits / max(temperature, eps)) + noise).argmax(dim = -1)
Min-P implementation from `autoregressive_wrapper.py:121-125`:
def min_p(logits, min_p = 0.1):
probs = logits.softmax(dim = -1)
max_probs = probs.amax(dim = -1, keepdim = True)
limit = min_p * max_probs
return torch.where(probs < limit, float('-inf'), logits)
Non-autoregressive temperature annealing from `nonautoregressive_wrapper.py:240-245`:
annealing_scale = steps_until_x0 / self.steps
temperature = start_temperature * annealing_scale
probs = (logits / max(temperature, 1e-3)).softmax(dim = -1)
sampled_ids = gumbel_sample(logits, temperature = max(temperature, 1e-3))
Contrastive decoding from `autoregressive_wrapper.py:138-152`:
def contrastive_decode_fn(
expert_logits,
amateur_logits,
alpha = 0.1,
beta = 0.5
):
cutoff = log(alpha) + expert_logits.amax(dim = -1, keepdim = True)
diffs = (1 + beta) * expert_logits - beta * amateur_logits
contrastive_decode_logits = diffs.masked_fill(expert_logits < cutoff, -float('inf'))
return contrastive_decode_logits
Related Pages
- Implementation:Lucidrains_X_transformers_AutoregressiveWrapper_Generate
- Implementation:Lucidrains_X_transformers_NonAutoregressiveWrapper_Generate
- Implementation:Lucidrains_X_transformers_BeliefStateWrapper
- Implementation:Lucidrains_X_transformers_XLAutoregressiveWrapper
- Principle:Lucidrains_X_transformers_Autoregressive_Text_Generation
- Principle:Lucidrains_X_transformers_Iterative_Masked_Generation
- Principle:Lucidrains_X_transformers_Belief_State_Training
- Principle:Lucidrains_X_transformers_Segment_Level_Recurrence