Principle:Lucidrains X transformers Autoregressive Text Generation

Metadata

Field	Value
Papers	Attention Is All You Need, Contrastive Decoding, Truncation Sampling (min-p)
Repository	x-transformers
Domains	Deep_Learning, NLP, Inference
Last Updated	2026-02-08 18:00 GMT

Overview

Sequential token generation strategy that produces text by sampling one token at a time from a trained autoregressive language model, conditioned on all previously generated tokens.

Description

After training, autoregressive generation produces sequences token-by-token. At each step, the model outputs a probability distribution over the vocabulary conditioned on the prompt plus all previously generated tokens. A token is sampled from this distribution (with optional filtering and temperature scaling) and appended to the sequence. This process repeats until a stop condition is met (maximum length or EOS token).

Various sampling strategies control the trade-off between quality and diversity:

Temperature scaling — adjusts the sharpness of the probability distribution
Top-k filtering — retains only the k highest-probability tokens
Top-p (nucleus) sampling — retains the smallest set of tokens whose cumulative probability meets a threshold
Min-p sampling — retains tokens whose probability is at least a fraction of the most likely token
Top-a sampling — adaptive threshold filtering
Contrastive decoding — uses an amateur model to subtract out undesirable patterns

KV caching speeds up generation by reusing previous key-value computations, avoiding redundant processing of the entire context at every step.

Usage

Use after training to generate text from prompts. Choose the sampling strategy based on the requirements of the application:

temperature=0 for greedy, deterministic output
top-k for bounded randomness with a fixed vocabulary window
top-p for dynamic vocabulary truncation that adapts to the model's confidence
Contrastive decoding for higher-quality output by contrasting expert and amateur models

Theoretical Basis

Autoregressive generation:

x_t ~ P(x_t | x_1, x_2, ..., x_{t-1})

Temperature scaling:

P'(x) = softmax(logits / T)

where T is the temperature parameter. T < 1 sharpens the distribution (more deterministic), T > 1 flattens it (more random).

Top-k filtering: Keep only the k highest-probability tokens, redistribute probability mass among them.

Top-p (nucleus) sampling: Keep the smallest set of tokens whose cumulative probability is greater than or equal to p.

Contrastive decoding:

score = (1 + β) · logits_expert − β · logits_amateur

filtered by threshold α to only consider tokens where the expert model assigns sufficient probability.

KV caching: Store key and value tensors from previous decoding steps to avoid recomputation, reducing the cost of generation from O(n²) to O(n) per token.

Related Pages

Implemented By

Implementation:Lucidrains_X_transformers_AutoregressiveWrapper_Generate

Uses Heuristic

Heuristic:Lucidrains_X_transformers_Sampling_Temperature_Strategy

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment