Principle:Lucidrains X transformers Autoregressive Text Generation
Metadata
| Field | Value |
|---|---|
| Papers | Attention Is All You Need, Contrastive Decoding, Truncation Sampling (min-p) |
| Repository | x-transformers |
| Domains | Deep_Learning, NLP, Inference |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Sequential token generation strategy that produces text by sampling one token at a time from a trained autoregressive language model, conditioned on all previously generated tokens.
Description
After training, autoregressive generation produces sequences token-by-token. At each step, the model outputs a probability distribution over the vocabulary conditioned on the prompt plus all previously generated tokens. A token is sampled from this distribution (with optional filtering and temperature scaling) and appended to the sequence. This process repeats until a stop condition is met (maximum length or EOS token).
Various sampling strategies control the trade-off between quality and diversity:
- Temperature scaling — adjusts the sharpness of the probability distribution
- Top-k filtering — retains only the k highest-probability tokens
- Top-p (nucleus) sampling — retains the smallest set of tokens whose cumulative probability meets a threshold
- Min-p sampling — retains tokens whose probability is at least a fraction of the most likely token
- Top-a sampling — adaptive threshold filtering
- Contrastive decoding — uses an amateur model to subtract out undesirable patterns
KV caching speeds up generation by reusing previous key-value computations, avoiding redundant processing of the entire context at every step.
Usage
Use after training to generate text from prompts. Choose the sampling strategy based on the requirements of the application:
- temperature=0 for greedy, deterministic output
- top-k for bounded randomness with a fixed vocabulary window
- top-p for dynamic vocabulary truncation that adapts to the model's confidence
- Contrastive decoding for higher-quality output by contrasting expert and amateur models
Theoretical Basis
Autoregressive generation:
x_t ~ P(x_t | x_1, x_2, ..., x_{t-1})
Temperature scaling:
P'(x) = softmax(logits / T)
where T is the temperature parameter. T < 1 sharpens the distribution (more deterministic), T > 1 flattens it (more random).
Top-k filtering: Keep only the k highest-probability tokens, redistribute probability mass among them.
Top-p (nucleus) sampling: Keep the smallest set of tokens whose cumulative probability is greater than or equal to p.
Contrastive decoding:
score = (1 + β) · logits_expert − β · logits_amateur
filtered by threshold α to only consider tokens where the expert model assigns sufficient probability.
KV caching: Store key and value tensors from previous decoding steps to avoid recomputation, reducing the cost of generation from O(n²) to O(n) per token.