Principle:Ggml org Llama cpp Ngram Speculative Drafting

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Speculative_Decoding
Last Updated	2026-02-15 00:00 GMT

Overview

Ngram Speculative Drafting is the principle of using n-gram statistics from previously generated text to predict future tokens without a separate draft model.

Description

This principle covers speculative decoding approaches that use n-gram caches built from the model's own output history rather than requiring a separate smaller draft model. By tracking which token sequences have appeared previously in the generation, the system can predict likely continuations and speculatively verify them in parallel with the target model. This includes both lookup-based and lookahead-based decoding strategies.

Usage

Apply this principle when you want to accelerate text generation through speculative decoding but do not have or do not want to use a separate draft model. It is most effective for repetitive or formulaic text where n-gram patterns recur frequently.

Theoretical Basis

N-gram speculative drafting maintains a map of observed n-gram sequences during generation. When predicting the next tokens, the system looks up the current context suffix in the n-gram cache and proposes the historically observed continuation as a draft. The target model then verifies this draft in a single forward pass (since verification of multiple tokens can be batched). Lookahead decoding extends this by maintaining a window of speculative tokens that are continuously refined. Lookup decoding uses a simpler approach of directly matching context suffixes against the n-gram cache. The effectiveness depends on the repetitiveness of the generated text and the n-gram cache hit rate.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment