Environment:Sail sg LongSpec Inference Environment

Knowledge Sources	LongSpec LongSpec README
Domains	Infrastructure, LLM_Inference, GPU_Kernels
Last Updated	2026-02-14 06:00 GMT

Overview

Single-GPU Linux environment with NVIDIA GPU (80GB VRAM recommended), Python 3.8+, PyTorch 2.5.1+, Flash Attention 2.6.3, Triton 3.1.0+, and Liger Kernel for LongSpec speculative decoding inference and evaluation.

Description

This environment provides the inference and evaluation stack for LongSpec speculative decoding. It requires a single high-VRAM GPU (80GB recommended per the README) for loading both the target LLM and the GLIDE draft model simultaneously. The stack uses Flash Attention for efficient prefix attention, a custom Triton kernel for tree-masked attention during verification, and Liger Kernel for fused operations. Models are loaded from HuggingFace Hub, requiring internet access or pre-downloaded weights.

Usage

Use this environment for all speculative decoding inference and benchmark evaluation workflows. This is the mandatory prerequisite for running the Tree_Spec_Generate, Triton_Tree_Attn_Kernel, Tree_Verification_Accept, Glide_Inference_Init, and Benchmark_Data_Loader implementations. Inference runs on a single GPU using `torch.inference_mode()`.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	Windows/Mac not tested
Python	>= 3.8	Lighter requirement than training
Hardware	Single NVIDIA GPU with 80GB VRAM	"It is recommended to test on a single 80GB GPU" per README
CUDA	CUDA toolkit compatible with PyTorch 2.5.1+	Required for flash_attn and triton
Disk	Space for target + draft model weights	e.g., QwQ-32B-Preview (~60GB) + draft model (~12GB)
Network	Internet access for HuggingFace model download	Or pre-downloaded model weights

Dependencies

System Packages

NVIDIA CUDA Toolkit (compatible with PyTorch 2.5.1+)

Python Packages

`torch` >= 2.5.1
`transformers` >= 4.46.3
`flash_attn` == 2.6.3
`triton` >= 3.1.0
`liger_kernel` == 0.3.1
`datasets` == 2.19.1
`tqdm` == 4.66.5

Credentials

HuggingFace model access: Both target model (e.g., `Qwen/QwQ-32B-Preview`) and draft model (e.g., `sail/longspec-QwQ-32B-Preview`) must be downloadable. May require `HF_TOKEN` for gated models.

Quick Install

# Clone and install
git clone https://github.com/sail-sg/LongSpec.git
cd LongSpec/longspec/test

# Install inference dependencies
pip install -r requirements.txt

Code Evidence

CUDA device requirement from `triton_tree_attn.py:40-41`:

device = torch.cuda.device_of(q)
num_sms = torch.cuda.get_device_properties(device).multi_processor_count

GPU capability check for kernel tuning from `triton_tree_attn.py:82-111`:

if torch.cuda.get_device_capability() == (8, 0):
    # A100-optimized block sizes
    if D <= 64:
        BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 64, 4, 4
    ...
elif torch.cuda.get_device_capability() == (8, 6):
    # RTX 3090-optimized block sizes
    ...
else:
    BLOCK_M, BLOCK_N, num_stages, num_warps = 32, 32, 1, 4

80GB VRAM recommendation from `README.md:102`:

It is recommended to test on a single 80GB GPU; otherwise, unexpected issues
such as insufficient VRAM may occur.

Flash Attention import from `llama_glide.py:11`:

from flash_attn import flash_attn_func, flash_attn_with_kvcache

Triton tree attention import from `llama_glide.py:8`:

from triton_tree_attn import attention as tree_attention

Common Errors

Error Message	Cause	Solution
`CUDA out of memory`	Insufficient VRAM for target + draft models	Use a single 80GB GPU (A100). Reduce `max_gen_len` if needed.
`ImportError: flash_attn`	Flash Attention not installed	`pip install flash_attn==2.6.3` (requires CUDA toolkit)
`AssertionError: Dk in {16, 32, 64, 128}`	Unsupported head dimension in Triton kernel	Ensure model config has standard head_dim (128 for Llama/Qwen2)
`AssertionError: H % Hk == 0`	Incompatible GQA head configuration	num_heads must be a multiple of num_kv_heads
Model download failure	HuggingFace access denied	Set `HF_TOKEN` environment variable for gated models

Compatibility Notes

GPU Architecture: The Triton tree attention kernel has optimized configurations for A100 (SM 8.0) and RTX 3090/4090 (SM 8.6). Other GPUs fall back to conservative 32x32 block sizes with reduced performance.
VRAM: Both target and draft models are loaded simultaneously. For QwQ-32B-Preview, this requires approximately 70GB VRAM in float16. Consumer GPUs with less VRAM may fail.
Batch Size: Inference is single-batch only (`batch_size=1`). The code does not support batched speculative decoding.
Context Length: Maximum context varies by model (e.g., Vicuna-7B: 16384, LongChat-7B: 32768, Llama-3-8B: 262000).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment