Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lucidrains X transformers Autoregressive Text Generation

From Leeroopedia


Metadata

Field Value
Papers Attention Is All You Need, Contrastive Decoding, Truncation Sampling (min-p)
Repository x-transformers
Domains Deep_Learning, NLP, Inference
Last Updated 2026-02-08 18:00 GMT

Overview

Sequential token generation strategy that produces text by sampling one token at a time from a trained autoregressive language model, conditioned on all previously generated tokens.

Description

After training, autoregressive generation produces sequences token-by-token. At each step, the model outputs a probability distribution over the vocabulary conditioned on the prompt plus all previously generated tokens. A token is sampled from this distribution (with optional filtering and temperature scaling) and appended to the sequence. This process repeats until a stop condition is met (maximum length or EOS token).

Various sampling strategies control the trade-off between quality and diversity:

  • Temperature scaling — adjusts the sharpness of the probability distribution
  • Top-k filtering — retains only the k highest-probability tokens
  • Top-p (nucleus) sampling — retains the smallest set of tokens whose cumulative probability meets a threshold
  • Min-p sampling — retains tokens whose probability is at least a fraction of the most likely token
  • Top-a sampling — adaptive threshold filtering
  • Contrastive decoding — uses an amateur model to subtract out undesirable patterns

KV caching speeds up generation by reusing previous key-value computations, avoiding redundant processing of the entire context at every step.

Usage

Use after training to generate text from prompts. Choose the sampling strategy based on the requirements of the application:

  • temperature=0 for greedy, deterministic output
  • top-k for bounded randomness with a fixed vocabulary window
  • top-p for dynamic vocabulary truncation that adapts to the model's confidence
  • Contrastive decoding for higher-quality output by contrasting expert and amateur models

Theoretical Basis

Autoregressive generation:

x_t ~ P(x_t | x_1, x_2, ..., x_{t-1})

Temperature scaling:

P'(x) = softmax(logits / T)

where T is the temperature parameter. T < 1 sharpens the distribution (more deterministic), T > 1 flattens it (more random).

Top-k filtering: Keep only the k highest-probability tokens, redistribute probability mass among them.

Top-p (nucleus) sampling: Keep the smallest set of tokens whose cumulative probability is greater than or equal to p.

Contrastive decoding:

score = (1 + β) · logits_expert − β · logits_amateur

filtered by threshold α to only consider tokens where the expert model assigns sufficient probability.

KV caching: Store key and value tensors from previous decoding steps to avoid recomputation, reducing the cost of generation from O(n²) to O(n) per token.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment