Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Sail sg LongSpec Inference Model Loading

From Leeroopedia
Knowledge Sources
Domains LLM_Inference, Model_Architecture, Speculative_Decoding
Last Updated 2026-02-14 05:00 GMT

Overview

Principle for loading a trained GLIDE draft model alongside its target LLM for speculative decoding inference, with pre-allocated KV caches and CUDA initialization.

Description

Inference Model Loading is the inference-time counterpart of GLIDE Model Initialization. While training initialization creates a fresh or partially-trained draft layer, inference loading assembles a fully-trained system ready for generation:

  • Load the target LLM (Llama or Qwen2) from HuggingFace in float16 with auto device mapping
  • Load the trained GLIDE draft layer from the published checkpoint (e.g., "sail/longspec-QwQ-32B-Preview")
  • Pre-allocate KV caches for both target and draft models based on expected sequence lengths
  • Configure model-specific tokens (pad_token_id, eos_token_id) for proper generation termination

The inference-side models (in longspec/test/) are distinct from training-side models: they include generation methods (tree_spec_generate, spec_generate, vanilla_generate) and KV cache management that the training models do not have.

Usage

Use when setting up GLIDE inference for benchmarking or deployment. The model is loaded once and reused across multiple inference calls. Model selection is done via a name-to-path registry:

  • "vicuna7b": Llama-based 7B model
  • "llama8b": Llama 3.1 8B Instruct
  • "qwq": Qwen2-based QwQ-32B-Preview (for math reasoning)

Theoretical Basis

The inference model assembly follows the same architectural principle as training initialization—a frozen target LLM with an attached draft layer—but with additional inference-time concerns:

  1. KV cache allocation: Pre-sized contiguous memory buffers for key-value states at each transformer layer
  2. Attention mode dispatch: The model must support multiple execution modes (prefill, decoding, tree_decoding) via the exec_type parameter
  3. Flash Attention integration: The inference path uses Flash Attention 2 with KV cache for efficient prefill and incremental decoding

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment