Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Romsto Speculative Decoding Ngram Assisted Speculative Generate

From Leeroopedia
Knowledge Sources
Domains NLP, Inference_Optimization
Last Updated 2026-02-14 04:30 GMT

Overview

Concrete tool for accelerating LLM inference using n-gram storage as a zero-cost drafter with greedy matching verification.

Description

The ngram_assisted_speculative_generate function implements the NASD algorithm. It replaces the neural drafter model with an INgramStorage instance for draft generation, uses greedy matching (direct token comparison) instead of rejection sampling for verification, and dynamically updates the n-gram storage during generation.

The function initializes the n-gram storage from the prompt, then enters a speculative loop: draft gamma tokens via n-gram lookup, verify with the target model in one forward pass, accept matching prefix, sample a bonus/correction token, and update the storage. The filler_top_k parameter controls how aggressively the storage is enriched with alternative tokens from the target distribution.

Usage

Import this function when you want speculative-style inference acceleration without loading a second neural model. Requires an INgramStorage instance (either NGramStorage or OneLevelNGramStorage) and a single target model. Best suited for greedy or low-temperature generation where draft acceptance rates are higher.

Code Reference

Source Location

Signature

@torch.no_grad()
def ngram_assisted_speculative_generate(
    inputs: List[int],
    ngramstorage: INgramStorage,
    target: Module,
    tokenizer = None,
    gamma: int = 5,
    filler_top_k: int = 3,
    logits_processor: LogitsProcessor = GreedyProcessor(),
    max_gen_len: int = 40,
    eos_tokens_id: int | List[int] = 1,
    pad_token_id: int = 0,
    use_cache: bool = False,
    first_target: bool = True,
    stop_if_unknown: bool = False,
    debug: bool = False,
) -> Tuple[List[int], float]:
    """
    Generate text using ngram assisted speculative decoding.

    Args:
        inputs (List[int]): input sequence of batch size 1.
        ngramstorage (INgramStorage): NGramStorage as a drafter.
        target (Module): target model.
        tokenizer: tokenizer (for debugging).
        gamma (int): number of draft tokens per step (default 5).
        filler_top_k (int): top-k tokens for n-gram update enrichment (default 3).
        logits_processor (LogitsProcessor): sampling strategy.
        max_gen_len (int): max new tokens to generate (default 40).
        eos_tokens_id: end token ID(s) (default 1).
        pad_token_id (int): pad token ID (default 0).
        use_cache (bool): enable KV-cache (default False).
        first_target (bool): run target prefill first (default True).
        stop_if_unknown (bool): stop drafting if n-gram has no prediction (default False).
        debug (bool): enable debug output (default False).

    Returns:
        List[int]: generated token sequence.
        float: acceptance rate.
    """

Import

from ngram_assisted import ngram_assisted_speculative_generate

I/O Contract

Inputs

Name Type Required Description
inputs List[int] Yes Tokenized input prompt (batch size 1)
ngramstorage INgramStorage Yes N-gram storage instance (NGramStorage or OneLevelNGramStorage)
target torch.nn.Module Yes Target language model (decoder-only)
tokenizer PreTrainedTokenizer No For debug printing only
gamma int No Draft tokens per speculative round (default: 5)
filler_top_k int No Top-k tokens to add to n-gram storage on each update (default: 3). Set to 1 for minimal updates.
logits_processor LogitsProcessor No Sampling strategy (default: GreedyProcessor())
max_gen_len int No Maximum new tokens (default: 40)
eos_tokens_id int or List[int] No End-of-sequence token ID(s) (default: 1)
pad_token_id int No Padding token ID (default: 0)
use_cache bool No Enable KV-cache (default: False)
first_target bool No Prefill target model before speculative loop (default: True)
stop_if_unknown bool No Stop drafting when n-gram has no prediction (default: False)
debug bool No Enable debug visualization (default: False)

Outputs

Name Type Description
generated_ids List[int] Generated token IDs (excludes the prompt)
acceptance_rate float Ratio of accepted draft tokens to total speculated tokens (0.0 if no drafts)

Usage Examples

Basic NASD Generation

from transformers import AutoModelForCausalLM, AutoTokenizer
from ngram_assisted import NGramStorage, ngram_assisted_speculative_generate
from utils.logits_processor import GreedyProcessor

# 1. Load target model only (no drafter needed)
target = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct", device_map="cuda"
)
target.eval()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# 2. Create n-gram storage
ngram_storage = NGramStorage(n=3, vocab_size=target.config.vocab_size)

# 3. Prepare input
prompt = "Explain the Fibonacci sequence."
chat = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").input_ids[0].tolist()

# 4. Generate
output_ids, accept_rate = ngram_assisted_speculative_generate(
    inputs,
    ngram_storage,
    target,
    tokenizer=tokenizer,
    gamma=5,
    filler_top_k=3,
    logits_processor=GreedyProcessor(),
    max_gen_len=100,
    eos_tokens_id=[tokenizer.eos_token_id],
    stop_if_unknown=True,
)

# 5. Decode
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)
print(f"Output: {output_text}")
print(f"Acceptance rate: {accept_rate:.3f}")

With Single-Level Storage

from ngram_assisted import OneLevelNGramStorage

# OneLevelNGramStorage uses exact (n-1)-token context only
ngram_storage = OneLevelNGramStorage(n=4, vocab_size=target.config.vocab_size)

output_ids, accept_rate = ngram_assisted_speculative_generate(
    inputs,
    ngram_storage,
    target,
    gamma=5,
    max_gen_len=50,
    eos_tokens_id=[tokenizer.eos_token_id],
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment