Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Romsto Speculative Decoding Encoder Decoder Speculative Decoding

From Leeroopedia
Knowledge Sources
Domains NLP, Inference_Optimization, Language_Models, Encoder_Decoder
Last Updated 2026-02-14 05:00 GMT

Overview

An inference acceleration technique that adapts speculative decoding for encoder-decoder transformer models, using a smaller drafter encoder-decoder model to propose decoder tokens that are verified in parallel by the larger target model.

Description

Encoder-Decoder Speculative Decoding extends the speculative decoding paradigm from decoder-only to encoder-decoder (seq2seq) architectures such as T5, BART, and mBART. In this setting, both the drafter and the target model share the same encoder-decoder structure. The encoder processes the input once on each model, and all speculative drafting and verification happens on the decoder side.

The core algorithm remains the same as decoder-only speculative decoding: the drafter generates gamma candidate decoder tokens autoregressively, the target verifies them in a single forward pass, and rejection sampling based on the probability ratio p/q determines how many tokens to accept. When a rejection occurs, the replacement token is sampled from the normalized positive part of (p - q) via the max_fn operation, preserving the target model's exact output distribution.

The key architectural difference is that both input_ids (encoder input) and decoder_input_ids (decoder prefix) must be passed to each model at every step. The encoder representations are fixed throughout generation, and the KV-cache (when enabled) covers both the encoder cross-attention and decoder self-attention layers.

This implementation also supports an optional first_target prefill step that runs the target model once before the speculative loop to generate an initial decoder token and warm the KV-cache.

Usage

Use this principle when accelerating inference from encoder-decoder models (translation, summarization, question answering with seq2seq architectures) and a compatible smaller encoder-decoder model is available as a drafter. Both models must share the same vocabulary size. The drafter and target should be from the same model family (e.g., T5-small drafting for T5-large) for high acceptance rates. N-gram-based drafting is not supported in this variant.

Theoretical Basis

Given encoder input x1,,xS, both models compute encoder representations: Hdrafter=Encoderdrafter(x) and Htarget=Encodertarget(x).

The speculative loop operates on the decoder side:

  1. The drafter decoder generates gamma tokens: for each position k, sample ykq(|y<k,Hdrafter)
  2. The target decoder evaluates all gamma positions in one forward pass: p(|y<k,Htarget) for each k
  3. Rejection sampling proceeds identically to decoder-only: accept token i if r<p(yi|y<i,Htarget)q(yi|y<i,Hdrafter)
  4. On rejection at position n, sample from norm(max(0,pnqn))
  5. On full acceptance, sample a bonus token from p(|y<γ+1,Htarget)

Pseudo-code:

# Abstract encoder-decoder speculative decoding
h_drafter = drafter.encoder(input_ids)
h_target = target.encoder(input_ids)
decoder_ids = [decoder_start_token_id]

while not done:
    # Step 1: Draft gamma decoder tokens with small model
    for k in range(gamma):
        q_k = drafter.decoder(decoder_ids, h_drafter)
        drafts[k] = sample(q_k[-1])
        decoder_ids.append(drafts[k])

    # Step 2: Verify all drafts with target decoder (single forward pass)
    p = target.decoder(decoder_ids, h_target)

    # Step 3: Rejection sampling (identical to decoder-only)
    n = gamma
    for i in range(gamma):
        if uniform(0,1) > p[i, drafts[i]] / q[i, drafts[i]]:
            n = i
            break

    # Step 4: Accept prefix, sample correction
    accept(drafts[:n])
    if n < gamma:
        x = sample(norm(max(0, p[n] - q[n])))
    else:
        x = sample(p[gamma])
    append(x)

The expected tokens per round follow the same formula as decoder-only: E[tokens]=1αγ+11α where α is the average acceptance rate.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment