Principle:Romsto Speculative Decoding Encoder Decoder Autoregressive Generation

Knowledge Sources	Attention Is All You Need Exploring the Limits of Transfer Learning with T5
Domains	NLP, Language_Models, Inference, Encoder_Decoder
Last Updated	2026-02-14 05:00 GMT

Overview

The standard sequential text generation method for encoder-decoder transformer models where the encoder processes the input once and the decoder generates tokens autoregressively, serving as the baseline against which encoder-decoder speculative methods are compared.

Description

Encoder-Decoder Autoregressive Generation adapts the canonical autoregressive generation paradigm for seq2seq transformer architectures (e.g., T5, BART, mBART). Unlike decoder-only autoregressive generation where the prompt and output share the same sequence space, encoder-decoder models separate the input processing (encoder) from output generation (decoder). The encoder processes the full input sequence once to produce a set of hidden representations. The decoder then generates tokens one at a time, attending both to its own previously generated tokens (via self-attention) and to the encoder representations (via cross-attention).

The decoder sequence is initialized with a special decoder_start_token_id from the model configuration. At each step, the model receives the fixed encoder input and the growing decoder prefix, computes a probability distribution over the vocabulary at the last decoder position, and samples from it using the chosen sampling strategy.

In this repository, encoder-decoder autoregressive generation serves as the baseline for evaluating the encoder-decoder speculative decoding variant. It is the direct counterpart of the decoder-only autoregressive baseline used for standard speculative decoding comparisons.

Usage

Use this principle when generating output from encoder-decoder models (translation, summarization, question answering with seq2seq architectures) without speculative acceleration. It is the appropriate method when no drafter encoder-decoder model is available, or when establishing baseline throughput for encoder-decoder speculative decoding benchmarks.

Theoretical Basis

Given encoder input $x_{1}, \dots, x_{S}$ , the encoder produces hidden states $H = Encoder (x_{1}, \dots, x_{S})$ . The decoder generates tokens sequentially:

$y_{t + 1} \sim P (y | y_{0}, y_{1}, \dots, y_{t}, H; θ)$

Where $y_{0}$ is the decoder start token, $θ$ are the model parameters, and P is the output distribution after sampling strategy application.

Pseudo-code:

# Abstract encoder-decoder autoregressive generation
encoder_hidden = encoder(input_ids)
decoder_ids = [decoder_start_token_id]

for position in range(1, max_length):
    logits = decoder(decoder_ids, encoder_hidden)[-1]  # last position
    probs = sampling_strategy(logits)
    next_token = sample(probs)
    decoder_ids.append(next_token)
    if next_token == eos_token:
        break

Key difference from decoder-only: The encoder forward pass happens once and its representations are reused at every decoder step via cross-attention. This means the encoder cost is amortized, and the per-token cost is dominated by the decoder self-attention and cross-attention computations.

Related Pages

Implemented By

Implementation:Romsto_Speculative_Decoding_Autoregressive_Generate_Encoder_Decoder

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment