Environment:Sail sg LongSpec Inference Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, LLM_Inference, GPU_Kernels |
| Last Updated | 2026-02-14 06:00 GMT |
Overview
Single-GPU Linux environment with NVIDIA GPU (80GB VRAM recommended), Python 3.8+, PyTorch 2.5.1+, Flash Attention 2.6.3, Triton 3.1.0+, and Liger Kernel for LongSpec speculative decoding inference and evaluation.
Description
This environment provides the inference and evaluation stack for LongSpec speculative decoding. It requires a single high-VRAM GPU (80GB recommended per the README) for loading both the target LLM and the GLIDE draft model simultaneously. The stack uses Flash Attention for efficient prefix attention, a custom Triton kernel for tree-masked attention during verification, and Liger Kernel for fused operations. Models are loaded from HuggingFace Hub, requiring internet access or pre-downloaded weights.
Usage
Use this environment for all speculative decoding inference and benchmark evaluation workflows. This is the mandatory prerequisite for running the Tree_Spec_Generate, Triton_Tree_Attn_Kernel, Tree_Verification_Accept, Glide_Inference_Init, and Benchmark_Data_Loader implementations. Inference runs on a single GPU using `torch.inference_mode()`.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | Windows/Mac not tested |
| Python | >= 3.8 | Lighter requirement than training |
| Hardware | Single NVIDIA GPU with 80GB VRAM | "It is recommended to test on a single 80GB GPU" per README |
| CUDA | CUDA toolkit compatible with PyTorch 2.5.1+ | Required for flash_attn and triton |
| Disk | Space for target + draft model weights | e.g., QwQ-32B-Preview (~60GB) + draft model (~12GB) |
| Network | Internet access for HuggingFace model download | Or pre-downloaded model weights |
Dependencies
System Packages
- NVIDIA CUDA Toolkit (compatible with PyTorch 2.5.1+)
Python Packages
- `torch` >= 2.5.1
- `transformers` >= 4.46.3
- `flash_attn` == 2.6.3
- `triton` >= 3.1.0
- `liger_kernel` == 0.3.1
- `datasets` == 2.19.1
- `tqdm` == 4.66.5
Credentials
- HuggingFace model access: Both target model (e.g., `Qwen/QwQ-32B-Preview`) and draft model (e.g., `sail/longspec-QwQ-32B-Preview`) must be downloadable. May require `HF_TOKEN` for gated models.
Quick Install
# Clone and install
git clone https://github.com/sail-sg/LongSpec.git
cd LongSpec/longspec/test
# Install inference dependencies
pip install -r requirements.txt
Code Evidence
CUDA device requirement from `triton_tree_attn.py:40-41`:
device = torch.cuda.device_of(q)
num_sms = torch.cuda.get_device_properties(device).multi_processor_count
GPU capability check for kernel tuning from `triton_tree_attn.py:82-111`:
if torch.cuda.get_device_capability() == (8, 0):
# A100-optimized block sizes
if D <= 64:
BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 64, 4, 4
...
elif torch.cuda.get_device_capability() == (8, 6):
# RTX 3090-optimized block sizes
...
else:
BLOCK_M, BLOCK_N, num_stages, num_warps = 32, 32, 1, 4
80GB VRAM recommendation from `README.md:102`:
It is recommended to test on a single 80GB GPU; otherwise, unexpected issues
such as insufficient VRAM may occur.
Flash Attention import from `llama_glide.py:11`:
from flash_attn import flash_attn_func, flash_attn_with_kvcache
Triton tree attention import from `llama_glide.py:8`:
from triton_tree_attn import attention as tree_attention
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` | Insufficient VRAM for target + draft models | Use a single 80GB GPU (A100). Reduce `max_gen_len` if needed. |
| `ImportError: flash_attn` | Flash Attention not installed | `pip install flash_attn==2.6.3` (requires CUDA toolkit) |
| `AssertionError: Dk in {16, 32, 64, 128}` | Unsupported head dimension in Triton kernel | Ensure model config has standard head_dim (128 for Llama/Qwen2) |
| `AssertionError: H % Hk == 0` | Incompatible GQA head configuration | num_heads must be a multiple of num_kv_heads |
| Model download failure | HuggingFace access denied | Set `HF_TOKEN` environment variable for gated models |
Compatibility Notes
- GPU Architecture: The Triton tree attention kernel has optimized configurations for A100 (SM 8.0) and RTX 3090/4090 (SM 8.6). Other GPUs fall back to conservative 32x32 block sizes with reduced performance.
- VRAM: Both target and draft models are loaded simultaneously. For QwQ-32B-Preview, this requires approximately 70GB VRAM in float16. Consumer GPUs with less VRAM may fail.
- Batch Size: Inference is single-batch only (`batch_size=1`). The code does not support batched speculative decoding.
- Context Length: Maximum context varies by model (e.g., Vicuna-7B: 16384, LongChat-7B: 32768, Llama-3-8B: 262000).