Principle:Romsto Speculative Decoding Ngram Assisted Speculative Decoding

Knowledge Sources	Fast Inference from Transformers via Speculative Decoding Speculative Decoding
Domains	NLP, Inference_Optimization, Language_Models
Last Updated	2026-02-14 04:30 GMT

Overview

An inference acceleration technique that replaces the neural drafter model in speculative decoding with an adaptive n-gram storage, eliminating the need for a second model while using greedy matching instead of rejection sampling for draft verification.

Description

N-gram Assisted Speculative Decoding (NASD) is an adaptation of the standard speculative decoding algorithm that replaces the small neural drafter model with a simple n-gram lookup table. This has two major advantages: (1) no second model needs to be loaded into GPU memory, and (2) draft generation is essentially free (dictionary lookup vs. neural network forward pass).

The n-gram storage is built dynamically during generation. It is first initialized with all n-grams found in the input prompt, then updated after each verification round with the accepted tokens and optionally the top-k tokens from the target model's distribution. Over time, the storage learns patterns specific to the current generation context.

A key difference from standard speculative decoding is the verification mechanism. Instead of rejection sampling (which requires computing probability ratios), NASD uses greedy matching: it samples a token from the target model's distribution at each draft position and accepts the draft only if the sampled token exactly matches the draft. This is simpler but does not preserve the theoretical distribution-matching guarantee of standard speculative decoding.

Usage

Use this principle when you want to accelerate LLM inference but cannot afford the memory overhead of loading a second drafter model, or when a suitable small drafter model is not available. NASD is especially effective for prompts and tasks where the model produces repetitive or predictable patterns (e.g., structured output, code generation) since the n-gram storage will accumulate useful predictions. It works best with greedy or low-temperature sampling where the target model's outputs are more deterministic.

Theoretical Basis

The NASD algorithm proceeds in rounds:

Initialize the n-gram storage with all n-grams from the input prompt
Draft gamma tokens by looking up the most frequent next token for the current context in the n-gram storage
Verify by running the target model on all drafted positions in a single forward pass
Accept a prefix of drafts where the target model's sampled token matches the draft exactly
Update the n-gram storage with accepted tokens and optionally top-k alternatives from the target

Pseudo-code:

# Abstract NASD algorithm
ngram_storage.initialize(prompt_tokens)

while not done:
    # Step 1: Draft using n-gram lookup (near-zero cost)
    for k in range(gamma):
        draft[k], known = ngram_storage.next_token(prefix + draft[:k])
        if not known and stop_if_unknown:
            break

    # Step 2: Verify with target model (single forward pass)
    target_probs = target(prefix + draft[:gamma])

    # Step 3: Greedy matching (not rejection sampling)
    n = gamma
    for i in range(gamma):
        sampled = sample(target_probs[i])
        if draft[i] != sampled:
            n = i
            break

    # Step 4: Accept prefix, get bonus token
    accept(draft[:n])
    x = sample(target_probs[n])
    append(x)

    # Step 5: Update n-gram storage
    for each accepted token:
        ngram_storage.update(context, token)
        ngram_storage.update(context, top_k_from_target)  # enrichment

The filler_top_k parameter controls n-gram enrichment: when set to k > 1, the top-k tokens from the target model's distribution are also recorded in storage (not just the accepted token). This helps the storage learn faster but uses slightly more memory.

Related Pages

Implemented By

Implementation:Romsto_Speculative_Decoding_Ngram_Assisted_Speculative_Generate

Uses Heuristic

Heuristic:Romsto_Speculative_Decoding_Filler_Top_K_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment