Principle:Romsto Speculative Decoding Autoregressive Generation

Knowledge Sources	Attention Is All You Need Language Models are Unsupervised Multitask Learners
Domains	NLP, Language_Models, Inference
Last Updated	2026-02-14 04:30 GMT

Overview

The standard sequential text generation method where each token is produced by conditioning on all previously generated tokens, serving as the baseline against which speculative methods are compared.

Description

Autoregressive Generation is the canonical method for producing text from decoder-only transformer language models. At each step, the model takes the full sequence of tokens generated so far, computes a probability distribution over the vocabulary for the next position, samples a token from that distribution (using a chosen sampling strategy), and appends it to the sequence. This process repeats until an end-of-sequence token is produced or a maximum length is reached.

While simple and correct, autoregressive generation is inherently sequential: each token depends on the previous one, so tokens cannot be generated in parallel. For large models, each forward pass is typically memory-bandwidth-bound, meaning the GPU's computational capacity is underutilized. This is the fundamental bottleneck that speculative decoding and NASD aim to address.

In this repository, autoregressive generation serves as the baseline for comparing throughput against speculative decoding and NASD in the interactive CLI.

Usage

Use this principle as the reference baseline for evaluating inference acceleration techniques. It is also the appropriate generation method when no drafter model or n-gram storage is available, or when absolute correctness without any approximation is required. The autoregressive method is used in the CLI comparison tool to measure the throughput improvement achieved by speculative methods.

Theoretical Basis

Given a prompt $x_{1}, \dots, x_{T}$ , autoregressive generation produces tokens sequentially:

$x_{t + 1} \sim P (x | x_{1}, \dots, x_{t}; θ)$

Where $θ$ are the model parameters and P is the output distribution after the chosen sampling strategy (greedy, nucleus, etc.) is applied.

Pseudo-code:

# Abstract autoregressive generation
for position in range(prompt_len, max_length):
    logits = model(tokens[:position])[-1]  # last position logits
    probs = sampling_strategy(logits)
    next_token = sample(probs)
    tokens[position] = next_token
    if next_token == eos_token:
        break

Computational cost: Each token requires one full forward pass through the model. For a model with d_model dimensions and L layers, this is O(L * d_model^2) per token, making total generation cost O(n * L * d_model^2) for n tokens.

Related Pages

Implemented By

Implementation:Romsto_Speculative_Decoding_Autoregressive_Generate

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment