Environment:Sail sg LongSpec Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Computing, Deep_Learning |
| Last Updated | 2026-02-14 06:00 GMT |
Overview
Linux environment with 8x NVIDIA GPUs (80GB VRAM recommended), Python 3.12+, PyTorch 2.6.0, DeepSpeed, Flash Attention 2.6.3, Triton 3.2.0, and CUDA toolkit for distributed GLIDE draft model training.
Description
This environment provides the full distributed training stack for LongSpec GLIDE draft models. It requires a multi-GPU setup (8x NVIDIA A100 80GB recommended) running DeepSpeed with ZeRO optimization stages 1-3. The stack includes Flash Attention 2 for efficient attention computation, Triton for custom tree attention kernels, FairScale for model parallelism, PEFT for parameter-efficient fine-tuning, and Liger Kernel for fused cross-entropy loss. Training uses FP16 mixed precision with BF16 support, and requires WandB for experiment tracking.
Usage
Use this environment for all GLIDE draft model training workflows. This is the mandatory prerequisite for running the DeepSpeed_Train_Loop, Qwen2Glide_Init, MultiMappingDataset, Save_Model_Extract, Hydra_YAML_Composition, and Stage_Progression_Config implementations. Training is launched via DeepSpeed on 8 GPUs using the train.sh script.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | Windows/Mac not supported |
| Python | >= 3.12 | Per train/README.md |
| Hardware | 8x NVIDIA GPU with 80GB VRAM | A100 80GB recommended; training config names reference "A100" |
| CUDA | CUDA toolkit compatible with PyTorch 2.6.0 | Required for flash_attn, triton, and DeepSpeed compilation |
| Disk | SSD with sufficient space for model weights and checkpoints | QwQ-32B-Preview weights (~60GB) plus checkpoint storage |
| Network | NCCL-compatible interconnect for multi-GPU | InfiniBand or NVLink recommended for 8-GPU training |
Dependencies
System Packages
- NVIDIA CUDA Toolkit (compatible with PyTorch 2.6.0)
- NCCL (for distributed communication)
- `git-lfs` (for downloading model weights from HuggingFace)
Python Packages
- `torch` == 2.6.0
- `transformers` == 4.51.1
- `deepspeed` (latest compatible)
- `flash_attn` == 2.6.3
- `triton` == 3.2.0
- `fairscale` == 0.4.13
- `accelerate` == 1.0.1
- `apex` == 0.9.10dev
- `bitsandbytes` == 0.45.5
- `datasets` == 2.19.1
- `liger_kernel` == 0.3.1
- `lightseq` == 3.0.1
- `omegaconf` == 2.3.0
- `peft` == 0.13.2
- `wandb` == 0.19.11
- `numpy` == 2.2.6
- `tqdm` == 4.66.5
- `sympy` == 1.13.1
- `regex` == 2024.9.11
- `Requests` == 2.32.3
- `fastchat` == 0.1.0
- `google_auth` == 2.37.0
Credentials
The following credentials must be configured before training:
- `WANDB_API_KEY`: Weights & Biases API key for experiment tracking. Authenticate via `wandb login` before launching training.
- HuggingFace model access: Target model weights (e.g., `Qwen/QwQ-32B-Preview`) must be accessible; may require `HF_TOKEN` for gated models.
Quick Install
# Clone and install
git clone https://github.com/sail-sg/LongSpec.git
cd LongSpec/longspec/train
# Install all training dependencies
pip install -r requirements.txt
# Install DeepSpeed (may need separate build)
pip install deepspeed
# Authenticate WandB
wandb login YOUR_API_KEY
Code Evidence
NCCL configuration from `trainer_base_ds_mul_fs_tp.py:449-452`:
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["WANDB__SERVICE_WAIT"] = "1200"
os.environ["NCCL_BLOCKING_WAIT"] = "1"
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"
DeepSpeed distributed initialization from `trainer_base_ds_mul_fs_tp.py:350-353`:
torch.cuda.set_device(cfg.local_rank)
device = str(torch.device("cuda", cfg.local_rank))
deepspeed.init_distributed(dist_backend="nccl", timeout=datetime.timedelta(seconds=9600))
TF32 matmul enablement from `trainer_base_ds_mul_fs_tp.py:31`:
torch.backends.cuda.matmul.allow_tf32 = True
Training README minimum requirements from `longspec/train/README.md:9-14`:
* python >= 3.12
* pytorch >= 2.6.0
* deepspeed
* wandb
* flash_attn
* Any additional dependencies listed in requirements.txt
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| NCCL timeout / hang | Multi-GPU communication failure | Ensure NCCL is properly installed; check `NCCL_BLOCKING_WAIT=1` is set. The trainer sets a 9600s timeout for DeepSpeed init. |
| `ImportError: flash_attn` | Flash Attention not installed | `pip install flash_attn==2.6.3` (requires CUDA toolkit) |
| `ImportError: triton` | Triton not installed | `pip install triton==3.2.0` |
| ZeRO-3 checkpoint save warning | Attempting to skip DS state save with ZeRO-3 | ZeRO-3 always requires saving checkpoint states since the model is sharded. The trainer auto-corrects this. |
| CUDA OOM during training | Insufficient VRAM for batch size | Reduce `per_gpu_train_batch_size` or increase `gradient_accumulation_steps`. Use ZeRO-3 with optimizer offload. |
Compatibility Notes
- GPU Architecture: Training configs reference A100 GPUs. The Triton tree attention kernel has optimized block sizes for A100 (SM 8.0) and RTX 3090 (SM 8.6), with a conservative fallback for other architectures.
- APEX: Required for FusedLAMB optimizer. The code handles both `fused_mixed_precision_lamb` and `fused_lamb` imports with fallback.
- Transformer Engine: Optional FP8 support via `transformer_engine` package (checked via `is_fp8_available()`). Not required for standard training.
- Model Parallelism: FairScale tensor parallelism is supported (`tp_size` config) but defaults to `tp_size=1`. When enabled, model weights must be pre-split into `mp_{rank}-of-{world_size}` subdirectories.
- DeepSpeed Launch: Training must be launched via `deepspeed --include localhost:0,1,2,3,4,5,6,7` for 8-GPU training.
Related Pages
- Implementation:Sail_sg_LongSpec_DeepSpeed_Train_Loop
- Implementation:Sail_sg_LongSpec_Qwen2Glide_Init
- Implementation:Sail_sg_LongSpec_MultiMappingDataset
- Implementation:Sail_sg_LongSpec_Save_Model_Extract
- Implementation:Sail_sg_LongSpec_Hydra_YAML_Composition
- Implementation:Sail_sg_LongSpec_Stage_Progression_Config