Principle:Romsto Speculative Decoding Ngram Storage

Knowledge Sources	An Empirical Study of Smoothing Techniques for Language Modeling Speculative Decoding
Domains	NLP, Statistical_Language_Modeling, Data_Structures
Last Updated	2026-02-14 04:30 GMT

Overview

A dynamically-updated data structure that stores observed n-gram frequency counts and predicts the most likely next token given a context window, serving as a lightweight drafter for n-gram assisted speculative decoding.

Description

N-gram Storage implements the classical statistical language modeling concept of predicting the next token based on the preceding (n-1) tokens, adapted for use as a speculative decoding drafter. Unlike traditional n-gram language models that are pre-trained on large corpora, this storage is built dynamically during inference: it is seeded with n-grams from the input prompt and continuously updated with tokens accepted by the target model.

The storage maintains frequency counts for each observed (context, next_token) pair and tracks the most frequent next token for each context. When queried, it returns the most frequent continuation for the given context. Two variants are provided:

Single-level storage (OneLevelNGramStorage): Uses only exact (n-1)-length contexts. Fails to predict if the exact context has not been seen.
Multi-level storage (NGramStorage): Tries the longest available context first, then falls back to shorter contexts (k-grams for k from n-1 down to 2). This provides better coverage at the cost of less precise predictions.

Usage

Use this principle when implementing n-gram assisted speculative decoding (NASD) and you need a lightweight, memory-efficient drafter that requires no GPU compute. The multi-level variant (NGramStorage) is recommended for most use cases as it provides better coverage through context fallback. The single-level variant is useful when you want strict context matching without fallback.

Theoretical Basis

An n-gram model estimates the probability of a token given its preceding context:

$P (w_{t} | w_{t - n + 1}, \dots, w_{t - 1}) \approx \frac{C (w_{t - n + 1}, \dots, w_{t})}{C (w_{t - n + 1}, \dots, w_{t - 1})}$

Where $C (\cdot)$ is the count of the given sequence in the observed data.

In this implementation, rather than computing full probabilities, the storage simply returns the most frequent next token (the mode):

${\hat{w}}_{t} = \arg \max_{w} C (w_{t - n + 1}, \dots, w_{t - 1}, w)$

Multi-level fallback: If no context of length (n-1) has been seen, try (n-2), then (n-3), down to length 1:

# Abstract multi-level n-gram lookup
def next_token(context):
    for length in range(n-1, 1, -1):
        gram = context[-length:]
        if gram in storage[length]:
            return storage[length][gram].most_frequent, known=True
    return random_token, known=False

Dynamic update: After each verification round, new (context, token) observations are added:

# Abstract n-gram update
def update(context, tokens):
    for length in range(min(n-1, len(context)), 1, -1):
        gram = context[-length:]
        for token in tokens:
            counts[length][gram][token] += 1
            if counts[length][gram][token] > counts[length][gram][best[length][gram]]:
                best[length][gram] = token

Related Pages

Implemented By

Implementation:Romsto_Speculative_Decoding_NGramStorage

Uses Heuristic

Heuristic:Romsto_Speculative_Decoding_Ngram_Order_Selection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment