Principle:Romsto Speculative Decoding Autoregressive Generation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Language_Models, Inference |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
The standard sequential text generation method where each token is produced by conditioning on all previously generated tokens, serving as the baseline against which speculative methods are compared.
Description
Autoregressive Generation is the canonical method for producing text from decoder-only transformer language models. At each step, the model takes the full sequence of tokens generated so far, computes a probability distribution over the vocabulary for the next position, samples a token from that distribution (using a chosen sampling strategy), and appends it to the sequence. This process repeats until an end-of-sequence token is produced or a maximum length is reached.
While simple and correct, autoregressive generation is inherently sequential: each token depends on the previous one, so tokens cannot be generated in parallel. For large models, each forward pass is typically memory-bandwidth-bound, meaning the GPU's computational capacity is underutilized. This is the fundamental bottleneck that speculative decoding and NASD aim to address.
In this repository, autoregressive generation serves as the baseline for comparing throughput against speculative decoding and NASD in the interactive CLI.
Usage
Use this principle as the reference baseline for evaluating inference acceleration techniques. It is also the appropriate generation method when no drafter model or n-gram storage is available, or when absolute correctness without any approximation is required. The autoregressive method is used in the CLI comparison tool to measure the throughput improvement achieved by speculative methods.
Theoretical Basis
Given a prompt , autoregressive generation produces tokens sequentially:
Where are the model parameters and P is the output distribution after the chosen sampling strategy (greedy, nucleus, etc.) is applied.
Pseudo-code:
# Abstract autoregressive generation
for position in range(prompt_len, max_length):
logits = model(tokens[:position])[-1] # last position logits
probs = sampling_strategy(logits)
next_token = sample(probs)
tokens[position] = next_token
if next_token == eos_token:
break
Computational cost: Each token requires one full forward pass through the model. For a model with d_model dimensions and L layers, this is O(L * d_model^2) per token, making total generation cost O(n * L * d_model^2) for n tokens.