Implementation:Sail sg LongSpec Glide Inference Init
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Model_Architecture |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Concrete tool for loading trained GLIDE inference models with target LLM, draft layer, and model registry for LongBench and AIME evaluation.
Description
The inference-side LlamaGlide and Qwen2Glide classes (in longspec/test/) provide the same initialization pattern as training but with inference-specific methods. The loading process uses a model registry (model_names dict) mapping friendly names to HuggingFace paths.
The initialization includes:
- AutoConfig.from_pretrained() to load target model architecture config
- AutoTokenizer.from_pretrained() to load the tokenizer
- LlamaGlide/Qwen2Glide constructor to load target + draft models
- Model-specific token configuration (e.g., QwQ requires specific pad/eos tokens)
Usage
Used at the beginning of inference scripts (inference_long-bench.py, inference_qwq.py) to set up the model before running evaluation.
Code Reference
Source Location
- Repository: LongSpec
- File (Llama): longspec/test/llama_glide.py
- Lines: L472-490
- File (Qwen2): longspec/test/qwen2_glide.py
- Lines: L475-493
- File (LongBench loading): longspec/test/inference_long-bench.py
- Lines: L82-112
- File (QwQ loading): longspec/test/inference_qwq.py
- Lines: L31-46
Signature
# Model classes (same __init__ as training side):
class LlamaGlide(LlamaForCausalLM):
def __init__(
self,
config: LlamaConfig,
target_model_path: str,
glide_path: Optional[str] = None,
) -> None:
"""
Args:
config: LlamaConfig from AutoConfig.from_pretrained()
target_model_path: HuggingFace path to target LLM
glide_path: Path to trained GLIDE draft layer weights
"""
class Qwen2Glide(Qwen2ForCausalLM):
def __init__(
self,
config: Qwen2Config,
target_model_path: str,
glide_path: Optional[str] = None,
) -> None:
"""Same interface as LlamaGlide for Qwen2 architecture."""
# Model registry (inference_long-bench.py):
model_names = {
"vicuna7b": {
"target": "lmsys/vicuna-7b-v1.5",
"draft": "sail/longspec-vicuna-7b-v1.5",
},
"llama8b": {
"target": "meta-llama/Llama-3.1-8B-Instruct",
"draft": "sail/longspec-Llama-3.1-8B-Instruct",
},
# ... more models
}
Import
from transformers import AutoConfig, AutoTokenizer
from longspec.test.llama_glide import LlamaGlide
from longspec.test.qwen2_glide import Qwen2Glide
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name | str | Yes | Key into model_names registry (e.g., "vicuna7b", "qwq") |
| config | AutoConfig | Yes | Architecture config loaded from target model |
| target_model_path | str | Yes | HuggingFace path to target LLM |
| glide_path | str | Yes | HuggingFace path or local path to trained GLIDE draft layer |
Outputs
| Name | Type | Description |
|---|---|---|
| model | LlamaGlide / Qwen2Glide | Fully loaded inference model on CUDA with target + draft |
| tokenizer | AutoTokenizer | Configured tokenizer with proper special tokens |
Usage Examples
LongBench Model Loading
from transformers import AutoConfig, AutoTokenizer
from longspec.test.llama_glide import LlamaGlide
model_names = {
"vicuna7b": {
"target": "lmsys/vicuna-7b-v1.5",
"draft": "sail/longspec-vicuna-7b-v1.5",
},
}
# Select model
model_name = "vicuna7b"
target = model_names[model_name]["target"]
draft = model_names[model_name]["draft"]
# Load config and tokenizer
config = AutoConfig.from_pretrained(target)
tokenizer = AutoTokenizer.from_pretrained(target)
# Initialize model
model = LlamaGlide(config, target, draft)
QwQ Model Loading (with Token Configuration)
from transformers import AutoConfig, AutoTokenizer
from longspec.test.qwen2_glide import Qwen2Glide
target = "Qwen/QwQ-32B-Preview"
draft = "sail/longspec-QwQ-32B-Preview"
config = AutoConfig.from_pretrained(target)
config.pad_token_id = 151643 # QwQ-specific pad token
config.eos_token_id = 151645 # QwQ-specific EOS token
tokenizer = AutoTokenizer.from_pretrained(target)
model = Qwen2Glide(config, target, draft)