Principle:Sail sg LongSpec Inference Model Loading
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Model_Architecture, Speculative_Decoding |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Principle for loading a trained GLIDE draft model alongside its target LLM for speculative decoding inference, with pre-allocated KV caches and CUDA initialization.
Description
Inference Model Loading is the inference-time counterpart of GLIDE Model Initialization. While training initialization creates a fresh or partially-trained draft layer, inference loading assembles a fully-trained system ready for generation:
- Load the target LLM (Llama or Qwen2) from HuggingFace in float16 with auto device mapping
- Load the trained GLIDE draft layer from the published checkpoint (e.g., "sail/longspec-QwQ-32B-Preview")
- Pre-allocate KV caches for both target and draft models based on expected sequence lengths
- Configure model-specific tokens (pad_token_id, eos_token_id) for proper generation termination
The inference-side models (in longspec/test/) are distinct from training-side models: they include generation methods (tree_spec_generate, spec_generate, vanilla_generate) and KV cache management that the training models do not have.
Usage
Use when setting up GLIDE inference for benchmarking or deployment. The model is loaded once and reused across multiple inference calls. Model selection is done via a name-to-path registry:
- "vicuna7b": Llama-based 7B model
- "llama8b": Llama 3.1 8B Instruct
- "qwq": Qwen2-based QwQ-32B-Preview (for math reasoning)
Theoretical Basis
The inference model assembly follows the same architectural principle as training initialization—a frozen target LLM with an attached draft layer—but with additional inference-time concerns:
- KV cache allocation: Pre-sized contiguous memory buffers for key-value states at each transformer layer
- Attention mode dispatch: The model must support multiple execution modes (prefill, decoding, tree_decoding) via the exec_type parameter
- Flash Attention integration: The inference path uses Flash Attention 2 with KV cache for efficient prefill and incremental decoding