Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Llama cpp Speculation Initialization

From Leeroopedia
Field Value
Principle Name Speculation Initialization
Workflow Speculative_Decoding
Step 4 of 5
Domain Speculative Decoding System Setup
Scope Setting up speculative decoding state: draft context, strategy selection, buffer allocation

Overview

Description

Speculation initialization is the step where the configured speculation strategy is materialized into a runtime execution state. This involves creating the draft inference context (for model-based strategies), instantiating the appropriate strategy-specific state objects (n-gram maps, caches, or draft contexts), and organizing them into a prioritized list of implementations to try during generation.

The initialization bridges the gap between the static configuration (common_params_speculative) and the runtime execution engine (common_speculative struct). It validates the configuration, creates backend resources, and prepares the system for the draft-then-verify generation loop.

Usage

Speculation initialization is called once after both the target model and draft model (if applicable) are loaded. The resulting common_speculative object is used throughout the generation loop for drafting and accepting tokens.

Theoretical Basis

The initialization process implements a strategy chain pattern where multiple speculation implementations can be configured simultaneously. During generation, the system tries each implementation in order and uses the first one that produces a non-empty draft. This fallback mechanism allows combining strategies for robustness.

Strategy ordering and prioritization:

The implementations are added to the chain in a specific order of preference:

  1. N-gram simple -- Can produce many tokens without any model, low cost
  2. N-gram map (key only) -- Structured key lookup, moderate precision
  3. N-gram map (key + 4 values) -- Higher acceptance rate, more expensive
  4. N-gram mod -- Modular hash-based n-gram with configurable parameters
  5. N-gram cache -- Static/dynamic cache-based lookup
  6. Draft model -- Highest quality drafts, highest cost
  7. EAGLE3 -- Advanced multi-token prediction

This ordering ensures that cheaper strategies are attempted first. If a low-cost n-gram method can produce a draft, the expensive draft model forward pass is avoided.

Draft context creation:

For model-based strategies, the initialization creates a llama_context for the draft model using the context parameters stored during draft model loading. This draft context:

  • Manages its own KV cache for the draft model's attention computation
  • Operates independently of the target model's context
  • Has its own batch processing configuration

State objects:

Each strategy type has its own state object that maintains the data structures needed for drafting:

  • common_speculative_state_draft: Holds references to target and draft contexts, and vocabulary mapping replacements
  • common_speculative_state_ngram_simple: Maintains n-gram and m-gram size configuration
  • common_speculative_state_ngram_map_k: Manages the n-gram map structure
  • common_speculative_state_ngram_mod: References a shared modular n-gram instance
  • common_speculative_state_ngram_cache: Manages static and dynamic n-gram caches

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment