Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Sail sg LongSpec Inference Environment

From Leeroopedia
Revision as of 18:35, 16 February 2026 by Admin (talk | contribs) (Auto-imported from environments/Sail_sg_LongSpec_Inference_Environment.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Infrastructure, LLM_Inference, GPU_Kernels
Last Updated 2026-02-14 06:00 GMT

Overview

Single-GPU Linux environment with NVIDIA GPU (80GB VRAM recommended), Python 3.8+, PyTorch 2.5.1+, Flash Attention 2.6.3, Triton 3.1.0+, and Liger Kernel for LongSpec speculative decoding inference and evaluation.

Description

This environment provides the inference and evaluation stack for LongSpec speculative decoding. It requires a single high-VRAM GPU (80GB recommended per the README) for loading both the target LLM and the GLIDE draft model simultaneously. The stack uses Flash Attention for efficient prefix attention, a custom Triton kernel for tree-masked attention during verification, and Liger Kernel for fused operations. Models are loaded from HuggingFace Hub, requiring internet access or pre-downloaded weights.

Usage

Use this environment for all speculative decoding inference and benchmark evaluation workflows. This is the mandatory prerequisite for running the Tree_Spec_Generate, Triton_Tree_Attn_Kernel, Tree_Verification_Accept, Glide_Inference_Init, and Benchmark_Data_Loader implementations. Inference runs on a single GPU using `torch.inference_mode()`.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) Windows/Mac not tested
Python >= 3.8 Lighter requirement than training
Hardware Single NVIDIA GPU with 80GB VRAM "It is recommended to test on a single 80GB GPU" per README
CUDA CUDA toolkit compatible with PyTorch 2.5.1+ Required for flash_attn and triton
Disk Space for target + draft model weights e.g., QwQ-32B-Preview (~60GB) + draft model (~12GB)
Network Internet access for HuggingFace model download Or pre-downloaded model weights

Dependencies

System Packages

  • NVIDIA CUDA Toolkit (compatible with PyTorch 2.5.1+)

Python Packages

  • `torch` >= 2.5.1
  • `transformers` >= 4.46.3
  • `flash_attn` == 2.6.3
  • `triton` >= 3.1.0
  • `liger_kernel` == 0.3.1
  • `datasets` == 2.19.1
  • `tqdm` == 4.66.5

Credentials

  • HuggingFace model access: Both target model (e.g., `Qwen/QwQ-32B-Preview`) and draft model (e.g., `sail/longspec-QwQ-32B-Preview`) must be downloadable. May require `HF_TOKEN` for gated models.

Quick Install

# Clone and install
git clone https://github.com/sail-sg/LongSpec.git
cd LongSpec/longspec/test

# Install inference dependencies
pip install -r requirements.txt

Code Evidence

CUDA device requirement from `triton_tree_attn.py:40-41`:

device = torch.cuda.device_of(q)
num_sms = torch.cuda.get_device_properties(device).multi_processor_count

GPU capability check for kernel tuning from `triton_tree_attn.py:82-111`:

if torch.cuda.get_device_capability() == (8, 0):
    # A100-optimized block sizes
    if D <= 64:
        BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 64, 4, 4
    ...
elif torch.cuda.get_device_capability() == (8, 6):
    # RTX 3090-optimized block sizes
    ...
else:
    BLOCK_M, BLOCK_N, num_stages, num_warps = 32, 32, 1, 4

80GB VRAM recommendation from `README.md:102`:

It is recommended to test on a single 80GB GPU; otherwise, unexpected issues
such as insufficient VRAM may occur.

Flash Attention import from `llama_glide.py:11`:

from flash_attn import flash_attn_func, flash_attn_with_kvcache

Triton tree attention import from `llama_glide.py:8`:

from triton_tree_attn import attention as tree_attention

Common Errors

Error Message Cause Solution
`CUDA out of memory` Insufficient VRAM for target + draft models Use a single 80GB GPU (A100). Reduce `max_gen_len` if needed.
`ImportError: flash_attn` Flash Attention not installed `pip install flash_attn==2.6.3` (requires CUDA toolkit)
`AssertionError: Dk in {16, 32, 64, 128}` Unsupported head dimension in Triton kernel Ensure model config has standard head_dim (128 for Llama/Qwen2)
`AssertionError: H % Hk == 0` Incompatible GQA head configuration num_heads must be a multiple of num_kv_heads
Model download failure HuggingFace access denied Set `HF_TOKEN` environment variable for gated models

Compatibility Notes

  • GPU Architecture: The Triton tree attention kernel has optimized configurations for A100 (SM 8.0) and RTX 3090/4090 (SM 8.6). Other GPUs fall back to conservative 32x32 block sizes with reduced performance.
  • VRAM: Both target and draft models are loaded simultaneously. For QwQ-32B-Preview, this requires approximately 70GB VRAM in float16. Consumer GPUs with less VRAM may fail.
  • Batch Size: Inference is single-batch only (`batch_size=1`). The code does not support batched speculative decoding.
  • Context Length: Maximum context varies by model (e.g., Vicuna-7B: 16384, LongChat-7B: 32768, Llama-3-8B: 262000).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment