Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Sail sg LongSpec GLIDE Model Initialization

From Leeroopedia
Knowledge Sources
Domains Speculative_Decoding, Model_Architecture, LLM_Inference
Last Updated 2026-02-14 05:00 GMT

Overview

Architectural principle for constructing a lightweight draft model by attaching a single cross-attention decoder layer to a frozen target Large Language Model.

Description

GLIDE (Global-Local Informed Draft Engine) Model Initialization defines how to construct a speculative decoding draft model that reuses the target LLM's representations. The core idea is that instead of training a separate smaller model, a single decoder layer is attached to the frozen target LLM. This layer uses:

  • Cross-attention to access the target LLM's key-value cache (global context)
  • Sliding-window self-attention for local context modeling
  • Feed-forward network for non-linear transformation

The target LLM (e.g., Qwen2, Llama) is loaded in full precision and frozen—only the draft layer's parameters are trainable. This dramatically reduces the number of trainable parameters (from billions to millions) while leveraging the target LLM's learned representations.

The initialization also optionally loads pre-trained draft layer weights from a previous training stage, enabling multi-stage progressive training.

Usage

Use this principle when constructing speculative decoding systems that need a lightweight draft model tightly coupled to the target LLM. This is preferable to independent draft models because:

  • The draft model directly attends to the target's KV cache, avoiding redundant computation
  • Only a single decoder layer needs training, requiring minimal compute and data
  • The architecture inherently maintains output distribution alignment with the target

Theoretical Basis

The GLIDE architecture exploits the observation that a pre-trained LLM's hidden states already contain rich token-level predictions. A single cross-attention layer can learn to extract next-token predictions from these hidden states efficiently.

Architecture:

# Abstract GLIDE architecture (not actual implementation)
class GLIDEDraftModel:
    target_llm: FrozenLLM           # Frozen, provides KV cache
    cross_attention: CrossAttnLayer  # Attends to target's KV cache
    self_attention: SelfAttnLayer    # Sliding-window local context
    ffn: FeedForwardNetwork          # Non-linear transformation
    lm_head: SharedLMHead            # Shared with target (frozen)

The draft model produces predictions by:

  1. Running the target LLM's prefill to populate the KV cache
  2. Using cross-attention to query the target's hidden states for next-token information
  3. Combining with self-attention over the draft model's own recent predictions
  4. Projecting through the shared (frozen) language modeling head

The loss function is standard cross-entropy, optionally fused via Liger kernel for efficiency:

=tlogPdraft(xt+1|xt,KVtarget)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment