Principle:Romsto Speculative Decoding Encoder Decoder Autoregressive Generation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Language_Models, Inference, Encoder_Decoder |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
The standard sequential text generation method for encoder-decoder transformer models where the encoder processes the input once and the decoder generates tokens autoregressively, serving as the baseline against which encoder-decoder speculative methods are compared.
Description
Encoder-Decoder Autoregressive Generation adapts the canonical autoregressive generation paradigm for seq2seq transformer architectures (e.g., T5, BART, mBART). Unlike decoder-only autoregressive generation where the prompt and output share the same sequence space, encoder-decoder models separate the input processing (encoder) from output generation (decoder). The encoder processes the full input sequence once to produce a set of hidden representations. The decoder then generates tokens one at a time, attending both to its own previously generated tokens (via self-attention) and to the encoder representations (via cross-attention).
The decoder sequence is initialized with a special decoder_start_token_id from the model configuration. At each step, the model receives the fixed encoder input and the growing decoder prefix, computes a probability distribution over the vocabulary at the last decoder position, and samples from it using the chosen sampling strategy.
In this repository, encoder-decoder autoregressive generation serves as the baseline for evaluating the encoder-decoder speculative decoding variant. It is the direct counterpart of the decoder-only autoregressive baseline used for standard speculative decoding comparisons.
Usage
Use this principle when generating output from encoder-decoder models (translation, summarization, question answering with seq2seq architectures) without speculative acceleration. It is the appropriate method when no drafter encoder-decoder model is available, or when establishing baseline throughput for encoder-decoder speculative decoding benchmarks.
Theoretical Basis
Given encoder input , the encoder produces hidden states . The decoder generates tokens sequentially:
Where is the decoder start token, are the model parameters, and P is the output distribution after sampling strategy application.
Pseudo-code:
# Abstract encoder-decoder autoregressive generation
encoder_hidden = encoder(input_ids)
decoder_ids = [decoder_start_token_id]
for position in range(1, max_length):
logits = decoder(decoder_ids, encoder_hidden)[-1] # last position
probs = sampling_strategy(logits)
next_token = sample(probs)
decoder_ids.append(next_token)
if next_token == eos_token:
break
Key difference from decoder-only: The encoder forward pass happens once and its representations are reused at every decoder step via cross-attention. This means the encoder cost is amortized, and the per-token cost is dominated by the decoder self-attention and cross-attention computations.