Implementation:Romsto Speculative Decoding Ngram Assisted Speculative Generate

Knowledge Sources	Speculative Decoding Fast Inference from Transformers via Speculative Decoding
Domains	NLP, Inference_Optimization
Last Updated	2026-02-14 04:30 GMT

Overview

Concrete tool for accelerating LLM inference using n-gram storage as a zero-cost drafter with greedy matching verification.

Description

The ngram_assisted_speculative_generate function implements the NASD algorithm. It replaces the neural drafter model with an INgramStorage instance for draft generation, uses greedy matching (direct token comparison) instead of rejection sampling for verification, and dynamically updates the n-gram storage during generation.

The function initializes the n-gram storage from the prompt, then enters a speculative loop: draft gamma tokens via n-gram lookup, verify with the target model in one forward pass, accept matching prefix, sample a bonus/correction token, and update the storage. The filler_top_k parameter controls how aggressively the storage is enriched with alternative tokens from the target distribution.

Usage

Import this function when you want speculative-style inference acceleration without loading a second neural model. Requires an INgramStorage instance (either NGramStorage or OneLevelNGramStorage) and a single target model. Best suited for greedy or low-temperature generation where draft acceptance rates are higher.

Code Reference

Source Location

Repository: Speculative-Decoding
File: ngram_assisted/ngram_assisted.py
Lines: L10-164

Signature

@torch.no_grad()
def ngram_assisted_speculative_generate(
    inputs: List[int],
    ngramstorage: INgramStorage,
    target: Module,
    tokenizer = None,
    gamma: int = 5,
    filler_top_k: int = 3,
    logits_processor: LogitsProcessor = GreedyProcessor(),
    max_gen_len: int = 40,
    eos_tokens_id: int | List[int] = 1,
    pad_token_id: int = 0,
    use_cache: bool = False,
    first_target: bool = True,
    stop_if_unknown: bool = False,
    debug: bool = False,
) -> Tuple[List[int], float]:
    """
    Generate text using ngram assisted speculative decoding.

    Args:
        inputs (List[int]): input sequence of batch size 1.
        ngramstorage (INgramStorage): NGramStorage as a drafter.
        target (Module): target model.
        tokenizer: tokenizer (for debugging).
        gamma (int): number of draft tokens per step (default 5).
        filler_top_k (int): top-k tokens for n-gram update enrichment (default 3).
        logits_processor (LogitsProcessor): sampling strategy.
        max_gen_len (int): max new tokens to generate (default 40).
        eos_tokens_id: end token ID(s) (default 1).
        pad_token_id (int): pad token ID (default 0).
        use_cache (bool): enable KV-cache (default False).
        first_target (bool): run target prefill first (default True).
        stop_if_unknown (bool): stop drafting if n-gram has no prediction (default False).
        debug (bool): enable debug output (default False).

    Returns:
        List[int]: generated token sequence.
        float: acceptance rate.
    """

Import

from ngram_assisted import ngram_assisted_speculative_generate

I/O Contract

Inputs

Name	Type	Required	Description
inputs	List[int]	Yes	Tokenized input prompt (batch size 1)
ngramstorage	INgramStorage	Yes	N-gram storage instance (NGramStorage or OneLevelNGramStorage)
target	torch.nn.Module	Yes	Target language model (decoder-only)
tokenizer	PreTrainedTokenizer	No	For debug printing only
gamma	int	No	Draft tokens per speculative round (default: 5)
filler_top_k	int	No	Top-k tokens to add to n-gram storage on each update (default: 3). Set to 1 for minimal updates.
logits_processor	LogitsProcessor	No	Sampling strategy (default: GreedyProcessor())
max_gen_len	int	No	Maximum new tokens (default: 40)
eos_tokens_id	int or List[int]	No	End-of-sequence token ID(s) (default: 1)
pad_token_id	int	No	Padding token ID (default: 0)
use_cache	bool	No	Enable KV-cache (default: False)
first_target	bool	No	Prefill target model before speculative loop (default: True)
stop_if_unknown	bool	No	Stop drafting when n-gram has no prediction (default: False)
debug	bool	No	Enable debug visualization (default: False)

Outputs

Name	Type	Description
generated_ids	List[int]	Generated token IDs (excludes the prompt)
acceptance_rate	float	Ratio of accepted draft tokens to total speculated tokens (0.0 if no drafts)

Usage Examples

Basic NASD Generation

from transformers import AutoModelForCausalLM, AutoTokenizer
from ngram_assisted import NGramStorage, ngram_assisted_speculative_generate
from utils.logits_processor import GreedyProcessor

# 1. Load target model only (no drafter needed)
target = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct", device_map="cuda"
)
target.eval()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# 2. Create n-gram storage
ngram_storage = NGramStorage(n=3, vocab_size=target.config.vocab_size)

# 3. Prepare input
prompt = "Explain the Fibonacci sequence."
chat = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").input_ids[0].tolist()

# 4. Generate
output_ids, accept_rate = ngram_assisted_speculative_generate(
    inputs,
    ngram_storage,
    target,
    tokenizer=tokenizer,
    gamma=5,
    filler_top_k=3,
    logits_processor=GreedyProcessor(),
    max_gen_len=100,
    eos_tokens_id=[tokenizer.eos_token_id],
    stop_if_unknown=True,
)

# 5. Decode
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)
print(f"Output: {output_text}")
print(f"Acceptance rate: {accept_rate:.3f}")

With Single-Level Storage

from ngram_assisted import OneLevelNGramStorage

# OneLevelNGramStorage uses exact (n-1)-token context only
ngram_storage = OneLevelNGramStorage(n=4, vocab_size=target.config.vocab_size)

output_ids, accept_rate = ngram_assisted_speculative_generate(
    inputs,
    ngram_storage,
    target,
    gamma=5,
    max_gen_len=50,
    eos_tokens_id=[tokenizer.eos_token_id],
)

Related Pages

Implements Principle

Principle:Romsto_Speculative_Decoding_Ngram_Assisted_Speculative_Decoding

Requires Environment

Environment:Romsto_Speculative_Decoding_CUDA_PyTorch

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment