Principle:Sail sg LongSpec Inference Model Loading

Knowledge Sources	LongSpec LongSpec
Domains	LLM_Inference, Model_Architecture, Speculative_Decoding
Last Updated	2026-02-14 05:00 GMT

Overview

Principle for loading a trained GLIDE draft model alongside its target LLM for speculative decoding inference, with pre-allocated KV caches and CUDA initialization.

Description

Inference Model Loading is the inference-time counterpart of GLIDE Model Initialization. While training initialization creates a fresh or partially-trained draft layer, inference loading assembles a fully-trained system ready for generation:

Load the target LLM (Llama or Qwen2) from HuggingFace in float16 with auto device mapping
Load the trained GLIDE draft layer from the published checkpoint (e.g., "sail/longspec-QwQ-32B-Preview")
Pre-allocate KV caches for both target and draft models based on expected sequence lengths
Configure model-specific tokens (pad_token_id, eos_token_id) for proper generation termination

The inference-side models (in longspec/test/) are distinct from training-side models: they include generation methods (tree_spec_generate, spec_generate, vanilla_generate) and KV cache management that the training models do not have.

Usage

Use when setting up GLIDE inference for benchmarking or deployment. The model is loaded once and reused across multiple inference calls. Model selection is done via a name-to-path registry:

"vicuna7b": Llama-based 7B model
"llama8b": Llama 3.1 8B Instruct
"qwq": Qwen2-based QwQ-32B-Preview (for math reasoning)

Theoretical Basis

The inference model assembly follows the same architectural principle as training initialization—a frozen target LLM with an attached draft layer—but with additional inference-time concerns:

KV cache allocation: Pre-sized contiguous memory buffers for key-value states at each transformer layer
Attention mode dispatch: The model must support multiple execution modes (prefill, decoding, tree_decoding) via the exec_type parameter
Flash Attention integration: The inference path uses Flash Attention 2 with KV cache for efficient prefill and incremental decoding

Related Pages

Implemented By

Implementation:Sail_sg_LongSpec_Glide_Inference_Init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment