Principle:Romsto Speculative Decoding Ngram Assisted Speculative Decoding
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference_Optimization, Language_Models |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
An inference acceleration technique that replaces the neural drafter model in speculative decoding with an adaptive n-gram storage, eliminating the need for a second model while using greedy matching instead of rejection sampling for draft verification.
Description
N-gram Assisted Speculative Decoding (NASD) is an adaptation of the standard speculative decoding algorithm that replaces the small neural drafter model with a simple n-gram lookup table. This has two major advantages: (1) no second model needs to be loaded into GPU memory, and (2) draft generation is essentially free (dictionary lookup vs. neural network forward pass).
The n-gram storage is built dynamically during generation. It is first initialized with all n-grams found in the input prompt, then updated after each verification round with the accepted tokens and optionally the top-k tokens from the target model's distribution. Over time, the storage learns patterns specific to the current generation context.
A key difference from standard speculative decoding is the verification mechanism. Instead of rejection sampling (which requires computing probability ratios), NASD uses greedy matching: it samples a token from the target model's distribution at each draft position and accepts the draft only if the sampled token exactly matches the draft. This is simpler but does not preserve the theoretical distribution-matching guarantee of standard speculative decoding.
Usage
Use this principle when you want to accelerate LLM inference but cannot afford the memory overhead of loading a second drafter model, or when a suitable small drafter model is not available. NASD is especially effective for prompts and tasks where the model produces repetitive or predictable patterns (e.g., structured output, code generation) since the n-gram storage will accumulate useful predictions. It works best with greedy or low-temperature sampling where the target model's outputs are more deterministic.
Theoretical Basis
The NASD algorithm proceeds in rounds:
- Initialize the n-gram storage with all n-grams from the input prompt
- Draft gamma tokens by looking up the most frequent next token for the current context in the n-gram storage
- Verify by running the target model on all drafted positions in a single forward pass
- Accept a prefix of drafts where the target model's sampled token matches the draft exactly
- Update the n-gram storage with accepted tokens and optionally top-k alternatives from the target
Pseudo-code:
# Abstract NASD algorithm
ngram_storage.initialize(prompt_tokens)
while not done:
# Step 1: Draft using n-gram lookup (near-zero cost)
for k in range(gamma):
draft[k], known = ngram_storage.next_token(prefix + draft[:k])
if not known and stop_if_unknown:
break
# Step 2: Verify with target model (single forward pass)
target_probs = target(prefix + draft[:gamma])
# Step 3: Greedy matching (not rejection sampling)
n = gamma
for i in range(gamma):
sampled = sample(target_probs[i])
if draft[i] != sampled:
n = i
break
# Step 4: Accept prefix, get bonus token
accept(draft[:n])
x = sample(target_probs[n])
append(x)
# Step 5: Update n-gram storage
for each accepted token:
ngram_storage.update(context, token)
ngram_storage.update(context, top_k_from_target) # enrichment
The filler_top_k parameter controls n-gram enrichment: when set to k > 1, the top-k tokens from the target model's distribution are also recorded in storage (not just the accepted token). This helps the storage learn faster but uses slightly more memory.