Principle:Ggml org Llama cpp Speculation Initialization

Field	Value
Principle Name	Speculation Initialization
Workflow	Speculative_Decoding
Step	4 of 5
Domain	Speculative Decoding System Setup
Scope	Setting up speculative decoding state: draft context, strategy selection, buffer allocation

Overview

Description

Speculation initialization is the step where the configured speculation strategy is materialized into a runtime execution state. This involves creating the draft inference context (for model-based strategies), instantiating the appropriate strategy-specific state objects (n-gram maps, caches, or draft contexts), and organizing them into a prioritized list of implementations to try during generation.

The initialization bridges the gap between the static configuration (common_params_speculative) and the runtime execution engine (common_speculative struct). It validates the configuration, creates backend resources, and prepares the system for the draft-then-verify generation loop.

Usage

Speculation initialization is called once after both the target model and draft model (if applicable) are loaded. The resulting common_speculative object is used throughout the generation loop for drafting and accepting tokens.

Theoretical Basis

The initialization process implements a strategy chain pattern where multiple speculation implementations can be configured simultaneously. During generation, the system tries each implementation in order and uses the first one that produces a non-empty draft. This fallback mechanism allows combining strategies for robustness.

Strategy ordering and prioritization:

The implementations are added to the chain in a specific order of preference:

N-gram simple -- Can produce many tokens without any model, low cost
N-gram map (key only) -- Structured key lookup, moderate precision
N-gram map (key + 4 values) -- Higher acceptance rate, more expensive
N-gram mod -- Modular hash-based n-gram with configurable parameters
N-gram cache -- Static/dynamic cache-based lookup
Draft model -- Highest quality drafts, highest cost
EAGLE3 -- Advanced multi-token prediction

This ordering ensures that cheaper strategies are attempted first. If a low-cost n-gram method can produce a draft, the expensive draft model forward pass is avoided.

Draft context creation:

For model-based strategies, the initialization creates a llama_context for the draft model using the context parameters stored during draft model loading. This draft context:

Manages its own KV cache for the draft model's attention computation
Operates independently of the target model's context
Has its own batch processing configuration

State objects:

Each strategy type has its own state object that maintains the data structures needed for drafting:

common_speculative_state_draft: Holds references to target and draft contexts, and vocabulary mapping replacements
common_speculative_state_ngram_simple: Maintains n-gram and m-gram size configuration
common_speculative_state_ngram_map_k: Manages the n-gram map structure
common_speculative_state_ngram_mod: References a shared modular n-gram instance
common_speculative_state_ngram_cache: Manages static and dynamic n-gram caches

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment