Principle:Ggml org Llama cpp Speculation Initialization
| Field | Value |
|---|---|
| Principle Name | Speculation Initialization |
| Workflow | Speculative_Decoding |
| Step | 4 of 5 |
| Domain | Speculative Decoding System Setup |
| Scope | Setting up speculative decoding state: draft context, strategy selection, buffer allocation |
Overview
Description
Speculation initialization is the step where the configured speculation strategy is materialized into a runtime execution state. This involves creating the draft inference context (for model-based strategies), instantiating the appropriate strategy-specific state objects (n-gram maps, caches, or draft contexts), and organizing them into a prioritized list of implementations to try during generation.
The initialization bridges the gap between the static configuration (common_params_speculative) and the runtime execution engine (common_speculative struct). It validates the configuration, creates backend resources, and prepares the system for the draft-then-verify generation loop.
Usage
Speculation initialization is called once after both the target model and draft model (if applicable) are loaded. The resulting common_speculative object is used throughout the generation loop for drafting and accepting tokens.
Theoretical Basis
The initialization process implements a strategy chain pattern where multiple speculation implementations can be configured simultaneously. During generation, the system tries each implementation in order and uses the first one that produces a non-empty draft. This fallback mechanism allows combining strategies for robustness.
Strategy ordering and prioritization:
The implementations are added to the chain in a specific order of preference:
- N-gram simple -- Can produce many tokens without any model, low cost
- N-gram map (key only) -- Structured key lookup, moderate precision
- N-gram map (key + 4 values) -- Higher acceptance rate, more expensive
- N-gram mod -- Modular hash-based n-gram with configurable parameters
- N-gram cache -- Static/dynamic cache-based lookup
- Draft model -- Highest quality drafts, highest cost
- EAGLE3 -- Advanced multi-token prediction
This ordering ensures that cheaper strategies are attempted first. If a low-cost n-gram method can produce a draft, the expensive draft model forward pass is avoided.
Draft context creation:
For model-based strategies, the initialization creates a llama_context for the draft model using the context parameters stored during draft model loading. This draft context:
- Manages its own KV cache for the draft model's attention computation
- Operates independently of the target model's context
- Has its own batch processing configuration
State objects:
Each strategy type has its own state object that maintains the data structures needed for drafting:
- common_speculative_state_draft: Holds references to target and draft contexts, and vocabulary mapping replacements
- common_speculative_state_ngram_simple: Maintains n-gram and m-gram size configuration
- common_speculative_state_ngram_map_k: Manages the n-gram map structure
- common_speculative_state_ngram_mod: References a shared modular n-gram instance
- common_speculative_state_ngram_cache: Manages static and dynamic n-gram caches