Environment:Sail sg LongSpec Training Environment

Knowledge Sources	LongSpec DeepSpeed LongSpec Train README
Domains	Infrastructure, Distributed_Computing, Deep_Learning
Last Updated	2026-02-14 06:00 GMT

Overview

Linux environment with 8x NVIDIA GPUs (80GB VRAM recommended), Python 3.12+, PyTorch 2.6.0, DeepSpeed, Flash Attention 2.6.3, Triton 3.2.0, and CUDA toolkit for distributed GLIDE draft model training.

Description

This environment provides the full distributed training stack for LongSpec GLIDE draft models. It requires a multi-GPU setup (8x NVIDIA A100 80GB recommended) running DeepSpeed with ZeRO optimization stages 1-3. The stack includes Flash Attention 2 for efficient attention computation, Triton for custom tree attention kernels, FairScale for model parallelism, PEFT for parameter-efficient fine-tuning, and Liger Kernel for fused cross-entropy loss. Training uses FP16 mixed precision with BF16 support, and requires WandB for experiment tracking.

Usage

Use this environment for all GLIDE draft model training workflows. This is the mandatory prerequisite for running the DeepSpeed_Train_Loop, Qwen2Glide_Init, MultiMappingDataset, Save_Model_Extract, Hydra_YAML_Composition, and Stage_Progression_Config implementations. Training is launched via DeepSpeed on 8 GPUs using the train.sh script.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	Windows/Mac not supported
Python	>= 3.12	Per train/README.md
Hardware	8x NVIDIA GPU with 80GB VRAM	A100 80GB recommended; training config names reference "A100"
CUDA	CUDA toolkit compatible with PyTorch 2.6.0	Required for flash_attn, triton, and DeepSpeed compilation
Disk	SSD with sufficient space for model weights and checkpoints	QwQ-32B-Preview weights (~60GB) plus checkpoint storage
Network	NCCL-compatible interconnect for multi-GPU	InfiniBand or NVLink recommended for 8-GPU training

Dependencies

System Packages

NVIDIA CUDA Toolkit (compatible with PyTorch 2.6.0)
NCCL (for distributed communication)
`git-lfs` (for downloading model weights from HuggingFace)

Python Packages

`torch` == 2.6.0
`transformers` == 4.51.1
`deepspeed` (latest compatible)
`flash_attn` == 2.6.3
`triton` == 3.2.0
`fairscale` == 0.4.13
`accelerate` == 1.0.1
`apex` == 0.9.10dev
`bitsandbytes` == 0.45.5
`datasets` == 2.19.1
`liger_kernel` == 0.3.1
`lightseq` == 3.0.1
`omegaconf` == 2.3.0
`peft` == 0.13.2
`wandb` == 0.19.11
`numpy` == 2.2.6
`tqdm` == 4.66.5
`sympy` == 1.13.1
`regex` == 2024.9.11
`Requests` == 2.32.3
`fastchat` == 0.1.0
`google_auth` == 2.37.0

Credentials

The following credentials must be configured before training:

`WANDB_API_KEY`: Weights & Biases API key for experiment tracking. Authenticate via `wandb login` before launching training.
HuggingFace model access: Target model weights (e.g., `Qwen/QwQ-32B-Preview`) must be accessible; may require `HF_TOKEN` for gated models.

Quick Install

# Clone and install
git clone https://github.com/sail-sg/LongSpec.git
cd LongSpec/longspec/train

# Install all training dependencies
pip install -r requirements.txt

# Install DeepSpeed (may need separate build)
pip install deepspeed

# Authenticate WandB
wandb login YOUR_API_KEY

Code Evidence

NCCL configuration from `trainer_base_ds_mul_fs_tp.py:449-452`:

os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["WANDB__SERVICE_WAIT"] = "1200"
os.environ["NCCL_BLOCKING_WAIT"] = "1"
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"

DeepSpeed distributed initialization from `trainer_base_ds_mul_fs_tp.py:350-353`:

torch.cuda.set_device(cfg.local_rank)
device = str(torch.device("cuda", cfg.local_rank))
deepspeed.init_distributed(dist_backend="nccl", timeout=datetime.timedelta(seconds=9600))

TF32 matmul enablement from `trainer_base_ds_mul_fs_tp.py:31`:

torch.backends.cuda.matmul.allow_tf32 = True

Training README minimum requirements from `longspec/train/README.md:9-14`:

* python >= 3.12
* pytorch >= 2.6.0
* deepspeed
* wandb
* flash_attn
* Any additional dependencies listed in requirements.txt

Common Errors

Error Message	Cause	Solution
NCCL timeout / hang	Multi-GPU communication failure	Ensure NCCL is properly installed; check `NCCL_BLOCKING_WAIT=1` is set. The trainer sets a 9600s timeout for DeepSpeed init.
`ImportError: flash_attn`	Flash Attention not installed	`pip install flash_attn==2.6.3` (requires CUDA toolkit)
`ImportError: triton`	Triton not installed	`pip install triton==3.2.0`
ZeRO-3 checkpoint save warning	Attempting to skip DS state save with ZeRO-3	ZeRO-3 always requires saving checkpoint states since the model is sharded. The trainer auto-corrects this.
CUDA OOM during training	Insufficient VRAM for batch size	Reduce `per_gpu_train_batch_size` or increase `gradient_accumulation_steps`. Use ZeRO-3 with optimizer offload.

Compatibility Notes

GPU Architecture: Training configs reference A100 GPUs. The Triton tree attention kernel has optimized block sizes for A100 (SM 8.0) and RTX 3090 (SM 8.6), with a conservative fallback for other architectures.
APEX: Required for FusedLAMB optimizer. The code handles both `fused_mixed_precision_lamb` and `fused_lamb` imports with fallback.
Transformer Engine: Optional FP8 support via `transformer_engine` package (checked via `is_fp8_available()`). Not required for standard training.
Model Parallelism: FairScale tensor parallelism is supported (`tp_size` config) but defaults to `tp_size=1`. When enabled, model weights must be pre-split into `mp_{rank}-of-{world_size}` subdirectories.
DeepSpeed Launch: Training must be launched via `deepspeed --include localhost:0,1,2,3,4,5,6,7` for 8-GPU training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment