Implementation:Romsto Speculative Decoding Ngram Assisted Speculative Generate
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference_Optimization |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
Concrete tool for accelerating LLM inference using n-gram storage as a zero-cost drafter with greedy matching verification.
Description
The ngram_assisted_speculative_generate function implements the NASD algorithm. It replaces the neural drafter model with an INgramStorage instance for draft generation, uses greedy matching (direct token comparison) instead of rejection sampling for verification, and dynamically updates the n-gram storage during generation.
The function initializes the n-gram storage from the prompt, then enters a speculative loop: draft gamma tokens via n-gram lookup, verify with the target model in one forward pass, accept matching prefix, sample a bonus/correction token, and update the storage. The filler_top_k parameter controls how aggressively the storage is enriched with alternative tokens from the target distribution.
Usage
Import this function when you want speculative-style inference acceleration without loading a second neural model. Requires an INgramStorage instance (either NGramStorage or OneLevelNGramStorage) and a single target model. Best suited for greedy or low-temperature generation where draft acceptance rates are higher.
Code Reference
Source Location
- Repository: Speculative-Decoding
- File: ngram_assisted/ngram_assisted.py
- Lines: L10-164
Signature
@torch.no_grad()
def ngram_assisted_speculative_generate(
inputs: List[int],
ngramstorage: INgramStorage,
target: Module,
tokenizer = None,
gamma: int = 5,
filler_top_k: int = 3,
logits_processor: LogitsProcessor = GreedyProcessor(),
max_gen_len: int = 40,
eos_tokens_id: int | List[int] = 1,
pad_token_id: int = 0,
use_cache: bool = False,
first_target: bool = True,
stop_if_unknown: bool = False,
debug: bool = False,
) -> Tuple[List[int], float]:
"""
Generate text using ngram assisted speculative decoding.
Args:
inputs (List[int]): input sequence of batch size 1.
ngramstorage (INgramStorage): NGramStorage as a drafter.
target (Module): target model.
tokenizer: tokenizer (for debugging).
gamma (int): number of draft tokens per step (default 5).
filler_top_k (int): top-k tokens for n-gram update enrichment (default 3).
logits_processor (LogitsProcessor): sampling strategy.
max_gen_len (int): max new tokens to generate (default 40).
eos_tokens_id: end token ID(s) (default 1).
pad_token_id (int): pad token ID (default 0).
use_cache (bool): enable KV-cache (default False).
first_target (bool): run target prefill first (default True).
stop_if_unknown (bool): stop drafting if n-gram has no prediction (default False).
debug (bool): enable debug output (default False).
Returns:
List[int]: generated token sequence.
float: acceptance rate.
"""
Import
from ngram_assisted import ngram_assisted_speculative_generate
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| inputs | List[int] | Yes | Tokenized input prompt (batch size 1) |
| ngramstorage | INgramStorage | Yes | N-gram storage instance (NGramStorage or OneLevelNGramStorage) |
| target | torch.nn.Module | Yes | Target language model (decoder-only) |
| tokenizer | PreTrainedTokenizer | No | For debug printing only |
| gamma | int | No | Draft tokens per speculative round (default: 5) |
| filler_top_k | int | No | Top-k tokens to add to n-gram storage on each update (default: 3). Set to 1 for minimal updates. |
| logits_processor | LogitsProcessor | No | Sampling strategy (default: GreedyProcessor()) |
| max_gen_len | int | No | Maximum new tokens (default: 40) |
| eos_tokens_id | int or List[int] | No | End-of-sequence token ID(s) (default: 1) |
| pad_token_id | int | No | Padding token ID (default: 0) |
| use_cache | bool | No | Enable KV-cache (default: False) |
| first_target | bool | No | Prefill target model before speculative loop (default: True) |
| stop_if_unknown | bool | No | Stop drafting when n-gram has no prediction (default: False) |
| debug | bool | No | Enable debug visualization (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| generated_ids | List[int] | Generated token IDs (excludes the prompt) |
| acceptance_rate | float | Ratio of accepted draft tokens to total speculated tokens (0.0 if no drafts) |
Usage Examples
Basic NASD Generation
from transformers import AutoModelForCausalLM, AutoTokenizer
from ngram_assisted import NGramStorage, ngram_assisted_speculative_generate
from utils.logits_processor import GreedyProcessor
# 1. Load target model only (no drafter needed)
target = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B-Instruct", device_map="cuda"
)
target.eval()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
# 2. Create n-gram storage
ngram_storage = NGramStorage(n=3, vocab_size=target.config.vocab_size)
# 3. Prepare input
prompt = "Explain the Fibonacci sequence."
chat = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").input_ids[0].tolist()
# 4. Generate
output_ids, accept_rate = ngram_assisted_speculative_generate(
inputs,
ngram_storage,
target,
tokenizer=tokenizer,
gamma=5,
filler_top_k=3,
logits_processor=GreedyProcessor(),
max_gen_len=100,
eos_tokens_id=[tokenizer.eos_token_id],
stop_if_unknown=True,
)
# 5. Decode
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)
print(f"Output: {output_text}")
print(f"Acceptance rate: {accept_rate:.3f}")
With Single-Level Storage
from ngram_assisted import OneLevelNGramStorage
# OneLevelNGramStorage uses exact (n-1)-token context only
ngram_storage = OneLevelNGramStorage(n=4, vocab_size=target.config.vocab_size)
output_ids, accept_rate = ngram_assisted_speculative_generate(
inputs,
ngram_storage,
target,
gamma=5,
max_gen_len=50,
eos_tokens_id=[tokenizer.eos_token_id],
)