Principle:Romsto Speculative Decoding Encoder Decoder Speculative Decoding

Knowledge Sources	Fast Inference from Transformers via Speculative Decoding Accelerating Large Language Model Decoding with Speculative Sampling Speculative Decoding
Domains	NLP, Inference_Optimization, Language_Models, Encoder_Decoder
Last Updated	2026-02-14 05:00 GMT

Overview

An inference acceleration technique that adapts speculative decoding for encoder-decoder transformer models, using a smaller drafter encoder-decoder model to propose decoder tokens that are verified in parallel by the larger target model.

Description

Encoder-Decoder Speculative Decoding extends the speculative decoding paradigm from decoder-only to encoder-decoder (seq2seq) architectures such as T5, BART, and mBART. In this setting, both the drafter and the target model share the same encoder-decoder structure. The encoder processes the input once on each model, and all speculative drafting and verification happens on the decoder side.

The core algorithm remains the same as decoder-only speculative decoding: the drafter generates gamma candidate decoder tokens autoregressively, the target verifies them in a single forward pass, and rejection sampling based on the probability ratio p/q determines how many tokens to accept. When a rejection occurs, the replacement token is sampled from the normalized positive part of (p - q) via the max_fn operation, preserving the target model's exact output distribution.

The key architectural difference is that both input_ids (encoder input) and decoder_input_ids (decoder prefix) must be passed to each model at every step. The encoder representations are fixed throughout generation, and the KV-cache (when enabled) covers both the encoder cross-attention and decoder self-attention layers.

This implementation also supports an optional first_target prefill step that runs the target model once before the speculative loop to generate an initial decoder token and warm the KV-cache.

Usage

Use this principle when accelerating inference from encoder-decoder models (translation, summarization, question answering with seq2seq architectures) and a compatible smaller encoder-decoder model is available as a drafter. Both models must share the same vocabulary size. The drafter and target should be from the same model family (e.g., T5-small drafting for T5-large) for high acceptance rates. N-gram-based drafting is not supported in this variant.

Theoretical Basis

Given encoder input $x_{1}, \dots, x_{S}$ , both models compute encoder representations: $H_{drafter} = {Encoder}_{drafter} (x)$ and $H_{target} = {Encoder}_{target} (x)$ .

The speculative loop operates on the decoder side:

The drafter decoder generates gamma tokens: for each position k, sample $y_{k} \sim q (\cdot | y_{< k}, H_{drafter})$
The target decoder evaluates all gamma positions in one forward pass: $p (\cdot | y_{< k}, H_{target})$ for each k
Rejection sampling proceeds identically to decoder-only: accept token i if $r < \frac{p (y_{i} | y_{< i}, H_{target})}{q (y_{i} | y_{< i}, H_{drafter})}$
On rejection at position n, sample from $norm (\max (0, p_{n} - q_{n}))$
On full acceptance, sample a bonus token from $p (\cdot | y_{< γ + 1}, H_{target})$

Pseudo-code:

# Abstract encoder-decoder speculative decoding
h_drafter = drafter.encoder(input_ids)
h_target = target.encoder(input_ids)
decoder_ids = [decoder_start_token_id]

while not done:
    # Step 1: Draft gamma decoder tokens with small model
    for k in range(gamma):
        q_k = drafter.decoder(decoder_ids, h_drafter)
        drafts[k] = sample(q_k[-1])
        decoder_ids.append(drafts[k])

    # Step 2: Verify all drafts with target decoder (single forward pass)
    p = target.decoder(decoder_ids, h_target)

    # Step 3: Rejection sampling (identical to decoder-only)
    n = gamma
    for i in range(gamma):
        if uniform(0,1) > p[i, drafts[i]] / q[i, drafts[i]]:
            n = i
            break

    # Step 4: Accept prefix, sample correction
    accept(drafts[:n])
    if n < gamma:
        x = sample(norm(max(0, p[n] - q[n])))
    else:
        x = sample(p[gamma])
    append(x)

The expected tokens per round follow the same formula as decoder-only: $E [tokens] = \frac{1 - α^{γ + 1}}{1 - α}$ where $α$ is the average acceptance rate.

Related Pages

Implemented By

Implementation:Romsto_Speculative_Decoding_Speculative_Generate_Encoder_Decoder

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment