Principle:Sail sg LongSpec GLIDE Model Initialization
| Knowledge Sources | |
|---|---|
| Domains | Speculative_Decoding, Model_Architecture, LLM_Inference |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Architectural principle for constructing a lightweight draft model by attaching a single cross-attention decoder layer to a frozen target Large Language Model.
Description
GLIDE (Global-Local Informed Draft Engine) Model Initialization defines how to construct a speculative decoding draft model that reuses the target LLM's representations. The core idea is that instead of training a separate smaller model, a single decoder layer is attached to the frozen target LLM. This layer uses:
- Cross-attention to access the target LLM's key-value cache (global context)
- Sliding-window self-attention for local context modeling
- Feed-forward network for non-linear transformation
The target LLM (e.g., Qwen2, Llama) is loaded in full precision and frozen—only the draft layer's parameters are trainable. This dramatically reduces the number of trainable parameters (from billions to millions) while leveraging the target LLM's learned representations.
The initialization also optionally loads pre-trained draft layer weights from a previous training stage, enabling multi-stage progressive training.
Usage
Use this principle when constructing speculative decoding systems that need a lightweight draft model tightly coupled to the target LLM. This is preferable to independent draft models because:
- The draft model directly attends to the target's KV cache, avoiding redundant computation
- Only a single decoder layer needs training, requiring minimal compute and data
- The architecture inherently maintains output distribution alignment with the target
Theoretical Basis
The GLIDE architecture exploits the observation that a pre-trained LLM's hidden states already contain rich token-level predictions. A single cross-attention layer can learn to extract next-token predictions from these hidden states efficiently.
Architecture:
# Abstract GLIDE architecture (not actual implementation)
class GLIDEDraftModel:
target_llm: FrozenLLM # Frozen, provides KV cache
cross_attention: CrossAttnLayer # Attends to target's KV cache
self_attention: SelfAttnLayer # Sliding-window local context
ffn: FeedForwardNetwork # Non-linear transformation
lm_head: SharedLMHead # Shared with target (frozen)
The draft model produces predictions by:
- Running the target LLM's prefill to populate the KV cache
- Using cross-attention to query the target's hidden states for next-token information
- Combining with self-attention over the draft model's own recent predictions
- Projecting through the shared (frozen) language modeling head
The loss function is standard cross-entropy, optionally fused via Liger kernel for efficiency: