Workflow:Romsto Speculative Decoding Ngram Assisted Speculative Decoding

Knowledge Sources	Speculative-Decoding Fast Inference from Transformers via Speculative Decoding Lossless Acceleration via Adaptive N-gram Parallel Decoding
Domains	LLMs, Inference_Optimization, Speculative_Decoding, Ngram_Methods
Last Updated	2026-02-14 04:30 GMT

Overview

End-to-end process for accelerating Large Language Model inference using n-gram assisted speculative decoding (NASD), which replaces the drafter neural network with a lightweight n-gram storage for draft generation.

Description

This workflow implements N-gram Assisted Speculative Decoding (NASD), a novel approach that eliminates the need for a second neural network drafter model. Instead, an n-gram storage data structure maintains counts of observed k-grams (for k in [2, N]) and predicts the next token by looking up the most frequent continuation of the current context. Before generation begins, the storage is seeded with all k-grams from the prompt. During generation, it is continuously updated with accepted tokens and top-k target model predictions, allowing it to adapt to the ongoing generation context.

Goal: Produce text output faster than standard autoregressive decoding without requiring a second neural network model.

Scope: Covers dependency installation, target model loading, n-gram storage initialization, input tokenization, sampling configuration, NASD generation with greedy matching verification, and output decoding.

Strategy: Replaces the neural drafter with a count-based n-gram lookup that is essentially free to query, making the drafting step negligible in cost. Verification uses greedy matching (direct comparison of sampled tokens) rather than rejection sampling, since the n-gram storage does not produce a proper probability distribution.

Usage

Execute this workflow when you want to accelerate inference of a single target language model (decoder-only) without loading a second model. This is especially useful when GPU memory is limited and cannot accommodate two models, or when no suitable drafter model exists for the target's model family. NASD is training-free, task-agnostic, and model-agnostic, making it applicable to any decoder-only transformer.

Execution Steps

Step 1: Install Dependencies

Set up the Python environment with the required libraries. The implementation depends on PyTorch for tensor operations and HuggingFace Transformers for model loading. Unlike standard speculative decoding, NASD does not require a second model, so memory requirements are lower.

Key considerations:

Requires Python 3.7 or later
PyTorch version must be 2.3.0 or higher
No additional drafter model download is needed
All dependencies are listed in the repository's requirements.txt

Step 2: Load Target Model

Load the target language model using HuggingFace's AutoModelForCausalLM. Only one model is needed since the n-gram storage replaces the drafter. Optionally apply int8 quantization to reduce GPU memory consumption. Also load the corresponding tokenizer.

Key considerations:

Only one neural network model is loaded, reducing memory requirements
The model must be a decoder-only transformer
The vocab_size from the model config is needed to initialize the n-gram storage
Set the model to evaluation mode after loading

Step 3: Initialize N-gram Storage

Create an n-gram storage instance that will serve as the drafter. Two implementations are available: a multi-level NGramStorage that maintains k-grams for all k in [2, N] with fallback from longer to shorter contexts, and a simpler OneLevelNGramStorage that only handles n-grams of a single fixed length.

What happens:

The storage is constructed with parameter N (the maximum n-gram order) and the model's vocabulary size
Multi-level storage (NGramStorage) tries the longest context first and falls back to shorter ones
Single-level storage (OneLevelNGramStorage) only looks at exactly (N-1) tokens of context
The storage will be seeded with prompt n-grams when generation begins

Key considerations:

Higher N captures longer patterns but may have sparser coverage
The multi-level variant with fallback is generally more robust
Setting top_k_filler=1 reproduces the NAPD approach from Ou et al. (2024)
The storage can be optionally reset between generations

Step 4: Prepare Input

Tokenize the user's text prompt into a list of token IDs. If using a chat-instructed model, apply the appropriate chat template before tokenization. The prompt tokens will seed the n-gram storage with initial patterns.

Key considerations:

The generation function expects a flat Python list of integer token IDs
Longer prompts provide more initial n-gram patterns for the storage
Chat-templated inputs require the correct template for the model family

Step 5: Configure Sampling Strategy

Select and instantiate a logits processor for token sampling from the target model's probability distribution. The same sampling strategies are available as for standard speculative decoding: greedy, multinomial, top-k, nucleus, or combined top-k+nucleus.

Key considerations:

Greedy decoding provides deterministic output
The logits processor is only applied to the target model's output (the n-gram storage uses count-based lookup)
Temperature and top-p/top-k settings affect both verification and the final token selection

Step 6: Run NASD Generation

Execute the n-gram assisted speculative decoding loop. The process begins by seeding the n-gram storage with all k-grams from the prompt. In each iteration, the storage generates gamma draft tokens by looking up the most likely continuation for the current context. The target model verifies all drafts in a single forward pass using greedy matching: each draft is compared against what the target model would have sampled, and the first mismatch ends acceptance. The storage is then updated with accepted tokens and optionally with the target model's top-k predictions at each verified position.

What happens:

The n-gram storage is initialized (seeded) with all k-grams from the prompt
In each iteration, the storage generates gamma drafts via context lookup (essentially free)
The target model processes all drafts in one forward pass
Greedy matching compares each draft token against the target's sampled token
The first mismatch terminates acceptance; the target's token replaces the rejected draft
The storage is updated with accepted tokens and optionally top-k target predictions
If stop_if_unknown is set, drafting halts early when the storage has no prediction for a context

Key considerations:

The gamma hyperparameter controls draft length, similar to standard speculative decoding
The filler_top_k parameter controls how aggressively the storage is updated with target model predictions
Setting filler_top_k=1 limits updates to only accepted tokens (NAPD-equivalent)
Higher filler_top_k enriches the storage faster but may introduce noise
The acceptance rate depends on how predictable the generation is from prompt patterns
KV-cache is optional and can speed up the target model's forward passes

Step 7: Decode Output

Convert the generated list of token IDs back into human-readable text using the tokenizer's decode method. The function returns both the token IDs and the acceptance rate.

Key considerations:

Use skip_special_tokens=True for clean text output
The acceptance rate indicates how often the n-gram storage correctly predicted the target model's output
A low acceptance rate suggests the generation content diverges significantly from prompt patterns

Execution Diagram

GitHub URL

Workflow Repository