Principle:Ggml org Llama cpp Speculation Strategy Selection

Field	Value
Principle Name	Speculation Strategy Selection
Workflow	Speculative_Decoding
Step	1 of 5
Domain	Speculative Decoding Configuration
Scope	Choosing the appropriate speculation strategy for accelerated token generation

Overview

Description

Speculative decoding accelerates autoregressive language model inference by generating multiple candidate tokens cheaply and then verifying them in parallel with the target model. The choice of speculation strategy determines how candidate (draft) tokens are produced: via a smaller draft model, via n-gram pattern matching on previously generated tokens, via EAGLE3 draft models, or via other heuristic methods.

The strategy selection principle governs how a user or system chooses among the available speculation methods based on trade-offs between draft quality (acceptance rate), draft speed (tokens generated per second), memory overhead, and implementation complexity.

Usage

Strategy selection is the first decision point in setting up speculative decoding. The choice depends on:

Whether a compatible draft model is available
Memory budget (draft models require additional VRAM/RAM)
Expected acceptance rate for the target domain
Whether the generation task has repetitive patterns (favoring n-gram methods)

Theoretical Basis

Speculative decoding strategies can be categorized along two axes:

1. Model-based strategies:

Draft Model (DRAFT): Uses a smaller, faster model of the same architecture family. The draft model generates candidate tokens autoregressively, which are then verified in a single forward pass of the target model. This provides the highest quality drafts but requires a separate model.
EAGLE3: An advanced draft strategy using a specialized architecture for efficient multi-token prediction.

2. N-gram based strategies (self-speculative): These strategies require no additional model and instead predict future tokens based on patterns observed in the current generation context.

NGRAM_SIMPLE: Looks up n-gram matches in the generation history. When the last n tokens match a previously seen sequence, the tokens that followed that previous occurrence are proposed as drafts.
NGRAM_MAP_K: Uses n-gram key lookup with a more structured map for higher precision matching.
NGRAM_MAP_K4V: Extends NGRAM_MAP_K with 4 m-gram values per key for broader draft generation, achieving higher acceptance rates at more computational cost.
NGRAM_MOD: Uses a modular n-gram approach with configurable n-gram sizes and a fixed-size hash table.
NGRAM_CACHE: Uses a 3-level n-gram cache (static file, dynamic file, and in-memory) for offline-trained n-gram patterns.

Key parameters governing strategy behavior:

n_max: Maximum number of tokens to draft per speculation step (default: 16)
n_min: Minimum number of draft tokens required (default: 0)
p_split: Split probability for tree-based speculation (default: 0.1)
p_min: Minimum probability threshold for greedy draft acceptance (default: 0.75)
ngram_size_n: N-gram key size for lookup (default: 12)
ngram_size_m: M-gram value size for speculative tokens (default: 48)

Strategy selection heuristics:

Use DRAFT when a small draft model is available and VRAM budget allows it -- this typically yields the highest acceptance rates.
Use NGRAM_SIMPLE or NGRAM_MOD for zero-overhead speculation on tasks with repetitive patterns (code generation, structured output).
Use NGRAM_CACHE when offline n-gram statistics are available for the target domain.
Multiple strategies can be configured simultaneously; the system tries them in order of preference and uses the first one that produces a non-empty draft.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment