Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Romsto Speculative Decoding Encoder Decoder Autoregressive Generation

From Leeroopedia
Revision as of 17:25, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Romsto_Speculative_Decoding_Encoder_Decoder_Autoregressive_Generation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains NLP, Language_Models, Inference, Encoder_Decoder
Last Updated 2026-02-14 05:00 GMT

Overview

The standard sequential text generation method for encoder-decoder transformer models where the encoder processes the input once and the decoder generates tokens autoregressively, serving as the baseline against which encoder-decoder speculative methods are compared.

Description

Encoder-Decoder Autoregressive Generation adapts the canonical autoregressive generation paradigm for seq2seq transformer architectures (e.g., T5, BART, mBART). Unlike decoder-only autoregressive generation where the prompt and output share the same sequence space, encoder-decoder models separate the input processing (encoder) from output generation (decoder). The encoder processes the full input sequence once to produce a set of hidden representations. The decoder then generates tokens one at a time, attending both to its own previously generated tokens (via self-attention) and to the encoder representations (via cross-attention).

The decoder sequence is initialized with a special decoder_start_token_id from the model configuration. At each step, the model receives the fixed encoder input and the growing decoder prefix, computes a probability distribution over the vocabulary at the last decoder position, and samples from it using the chosen sampling strategy.

In this repository, encoder-decoder autoregressive generation serves as the baseline for evaluating the encoder-decoder speculative decoding variant. It is the direct counterpart of the decoder-only autoregressive baseline used for standard speculative decoding comparisons.

Usage

Use this principle when generating output from encoder-decoder models (translation, summarization, question answering with seq2seq architectures) without speculative acceleration. It is the appropriate method when no drafter encoder-decoder model is available, or when establishing baseline throughput for encoder-decoder speculative decoding benchmarks.

Theoretical Basis

Given encoder input x1,,xS, the encoder produces hidden states H=Encoder(x1,,xS). The decoder generates tokens sequentially:

yt+1P(y|y0,y1,,yt,H;θ)

Where y0 is the decoder start token, θ are the model parameters, and P is the output distribution after sampling strategy application.

Pseudo-code:

# Abstract encoder-decoder autoregressive generation
encoder_hidden = encoder(input_ids)
decoder_ids = [decoder_start_token_id]

for position in range(1, max_length):
    logits = decoder(decoder_ids, encoder_hidden)[-1]  # last position
    probs = sampling_strategy(logits)
    next_token = sample(probs)
    decoder_ids.append(next_token)
    if next_token == eos_token:
        break

Key difference from decoder-only: The encoder forward pass happens once and its representations are reused at every decoder step via cross-attention. This means the encoder cost is amortized, and the per-token cost is dominated by the decoder self-attention and cross-attention computations.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment