Principle:Romsto Speculative Decoding Encoder Decoder Speculative Decoding
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference_Optimization, Language_Models, Encoder_Decoder |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
An inference acceleration technique that adapts speculative decoding for encoder-decoder transformer models, using a smaller drafter encoder-decoder model to propose decoder tokens that are verified in parallel by the larger target model.
Description
Encoder-Decoder Speculative Decoding extends the speculative decoding paradigm from decoder-only to encoder-decoder (seq2seq) architectures such as T5, BART, and mBART. In this setting, both the drafter and the target model share the same encoder-decoder structure. The encoder processes the input once on each model, and all speculative drafting and verification happens on the decoder side.
The core algorithm remains the same as decoder-only speculative decoding: the drafter generates gamma candidate decoder tokens autoregressively, the target verifies them in a single forward pass, and rejection sampling based on the probability ratio p/q determines how many tokens to accept. When a rejection occurs, the replacement token is sampled from the normalized positive part of (p - q) via the max_fn operation, preserving the target model's exact output distribution.
The key architectural difference is that both input_ids (encoder input) and decoder_input_ids (decoder prefix) must be passed to each model at every step. The encoder representations are fixed throughout generation, and the KV-cache (when enabled) covers both the encoder cross-attention and decoder self-attention layers.
This implementation also supports an optional first_target prefill step that runs the target model once before the speculative loop to generate an initial decoder token and warm the KV-cache.
Usage
Use this principle when accelerating inference from encoder-decoder models (translation, summarization, question answering with seq2seq architectures) and a compatible smaller encoder-decoder model is available as a drafter. Both models must share the same vocabulary size. The drafter and target should be from the same model family (e.g., T5-small drafting for T5-large) for high acceptance rates. N-gram-based drafting is not supported in this variant.
Theoretical Basis
Given encoder input , both models compute encoder representations: and .
The speculative loop operates on the decoder side:
- The drafter decoder generates gamma tokens: for each position k, sample
- The target decoder evaluates all gamma positions in one forward pass: for each k
- Rejection sampling proceeds identically to decoder-only: accept token i if
- On rejection at position n, sample from
- On full acceptance, sample a bonus token from
Pseudo-code:
# Abstract encoder-decoder speculative decoding
h_drafter = drafter.encoder(input_ids)
h_target = target.encoder(input_ids)
decoder_ids = [decoder_start_token_id]
while not done:
# Step 1: Draft gamma decoder tokens with small model
for k in range(gamma):
q_k = drafter.decoder(decoder_ids, h_drafter)
drafts[k] = sample(q_k[-1])
decoder_ids.append(drafts[k])
# Step 2: Verify all drafts with target decoder (single forward pass)
p = target.decoder(decoder_ids, h_target)
# Step 3: Rejection sampling (identical to decoder-only)
n = gamma
for i in range(gamma):
if uniform(0,1) > p[i, drafts[i]] / q[i, drafts[i]]:
n = i
break
# Step 4: Accept prefix, sample correction
accept(drafts[:n])
if n < gamma:
x = sample(norm(max(0, p[n] - q[n])))
else:
x = sample(p[gamma])
append(x)
The expected tokens per round follow the same formula as decoder-only: where is the average acceptance rate.