Principle:Sail sg LongSpec GLIDE Model Initialization

Knowledge Sources	LongSpec: Long-Context Speculative Decoding LongSpec
Domains	Speculative_Decoding, Model_Architecture, LLM_Inference
Last Updated	2026-02-14 05:00 GMT

Overview

Architectural principle for constructing a lightweight draft model by attaching a single cross-attention decoder layer to a frozen target Large Language Model.

Description

GLIDE (Global-Local Informed Draft Engine) Model Initialization defines how to construct a speculative decoding draft model that reuses the target LLM's representations. The core idea is that instead of training a separate smaller model, a single decoder layer is attached to the frozen target LLM. This layer uses:

Cross-attention to access the target LLM's key-value cache (global context)
Sliding-window self-attention for local context modeling
Feed-forward network for non-linear transformation

The target LLM (e.g., Qwen2, Llama) is loaded in full precision and frozen—only the draft layer's parameters are trainable. This dramatically reduces the number of trainable parameters (from billions to millions) while leveraging the target LLM's learned representations.

The initialization also optionally loads pre-trained draft layer weights from a previous training stage, enabling multi-stage progressive training.

Usage

Use this principle when constructing speculative decoding systems that need a lightweight draft model tightly coupled to the target LLM. This is preferable to independent draft models because:

The draft model directly attends to the target's KV cache, avoiding redundant computation
Only a single decoder layer needs training, requiring minimal compute and data
The architecture inherently maintains output distribution alignment with the target

Theoretical Basis

The GLIDE architecture exploits the observation that a pre-trained LLM's hidden states already contain rich token-level predictions. A single cross-attention layer can learn to extract next-token predictions from these hidden states efficiently.

Architecture:

# Abstract GLIDE architecture (not actual implementation)
class GLIDEDraftModel:
    target_llm: FrozenLLM           # Frozen, provides KV cache
    cross_attention: CrossAttnLayer  # Attends to target's KV cache
    self_attention: SelfAttnLayer    # Sliding-window local context
    ffn: FeedForwardNetwork          # Non-linear transformation
    lm_head: SharedLMHead            # Shared with target (frozen)

The draft model produces predictions by:

Running the target LLM's prefill to populate the KV cache
Using cross-attention to query the target's hidden states for next-token information
Combining with self-attention over the draft model's own recent predictions
Projecting through the shared (frozen) language modeling head

The loss function is standard cross-entropy, optionally fused via Liger kernel for efficiency:

$ℒ = - \sum_{t} \log P_{draft} (x_{t + 1} | x_{\leq t}, {KV}_{target})$

Related Pages

Implemented By

Implementation:Sail_sg_LongSpec_Qwen2Glide_Init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment