Environment:Allenai Open instruct CUDA GPU Training
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
NVIDIA CUDA GPU environment required for all training and GPU-accelerated operations in Open Instruct.
Description
All training workflows (SFT, DPO, GRPO, Reward Modeling) require NVIDIA CUDA GPUs. The repository uses PyTorch with CUDA backend for all tensor operations, distributed training via NCCL, and vLLM for inference during GRPO. Tests are conditionally skipped when CUDA is not available, and platform-specific packages (vLLM, flash-attn, bitsandbytes, liger-kernel) are excluded on macOS.
Usage
Use this environment for any training or GPU-accelerated evaluation task. All training scripts (SFT via finetune.py, DPO via dpo_tune_cache.py, GRPO via grpo_fast.py, Reward Modeling via reward_modeling.py) require CUDA. CPU-only execution is limited to dataset preprocessing, testing non-GPU paths, and utility scripts.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Ubuntu 22.04 LTS | Docker base image uses nvidia/cuda:12.9.0-devel-ubuntu22.04 |
| Hardware | NVIDIA GPU with CUDA support | Minimum 1 GPU; 8x GPU nodes typical for distributed training |
| VRAM | 16GB+ per GPU | A100 (40/80GB) or H100 (80GB) recommended for large models |
| Disk | 50GB+ SSD | For model weights, datasets, and checkpoints |
| Network | InfiniBand or high-speed Ethernet | Required for multi-node training via NCCL |
Dependencies
System Packages
- CUDA Toolkit 12.9
- cuDNN (bundled with CUDA toolkit)
- NVIDIA DOCA OFED drivers (version 2.10.0 for Mellanox networking)
- Mellanox Firmware Tools (MFT version 4.31.0-149)
Python Packages
- `torch` >= 2.9.0, < 2.10 (with CUDA 12.9 backend)
- `deepspeed` >= 0.18.3
- `flash-attn` >= 2.8.3 (Linux x86_64 only)
- `bitsandbytes` >= 0.44.1 (Linux only)
- `liger-kernel` >= 0.5.4 (Linux only)
Credentials
The following environment variables are relevant for GPU configuration:
- `CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible (default: "0,1,2,3,4,5,6,7")
- `NCCL_CUMEM_ENABLE`: Must be set to "0" for vLLM compatibility
- `NCCL_DEBUG`: Debug level for NCCL (typically "ERROR")
Quick Install
# Install via uv (recommended)
uv sync
# Or install key GPU packages manually
pip install torch>=2.9.0 deepspeed>=0.18.3 flash-attn>=2.8.3 bitsandbytes>=0.44.1 liger-kernel>=0.5.4
Code Evidence
GPU availability check from `conftest.py:19-20`:
if not torch.cuda.is_available():
collect_ignore.extend(str(p) for p in pathlib.Path("open_instruct").glob("*_gpu.py"))
CUDA device setup from `grpo_fast.py:202`:
torch.cuda.set_device(self.local_rank)
self.device = torch.device(self.local_rank)
Platform-conditional dependencies from `pyproject.toml:11,30,34-35`:
"bitsandbytes>=0.44.1; platform_system != 'Darwin'",
"vllm==0.14.1; platform_system != 'Darwin'",
"flash-attn>=2.8.3; platform_system != 'Darwin' and platform_machine != 'aarch64'",
"liger-kernel>=0.5.4; platform_system != 'Darwin'",
Default GPU configuration from `utils.py:1558-1561`:
cuda_visible_devices = [int(x) for x in os.environ.get(
"CUDA_VISIBLE_DEVICES", "0,1,2,3,4,5,6,7"
).split(",")]
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: 0 active drivers ([])` | DeepSpeed imported on CPU-only machine | Already handled via try/except in utils.py; no action needed |
| `CUDA out of memory` | Insufficient GPU VRAM for model size | Reduce batch size, enable gradient checkpointing, or use DeepSpeed ZeRO-3 |
| Padding-free tests skipped | CUDA not available | Install NVIDIA drivers and CUDA toolkit |
Compatibility Notes
- macOS (Darwin): vLLM, bitsandbytes, flash-attn, and liger-kernel are excluded. Only CPU-based dataset processing and testing is supported.
- ARM Linux (aarch64): flash-attn is not supported. Uses PyTorch with CUDA 13.0 index instead of 12.9.
- Multi-node: Requires cluster-specific NCCL configuration (InfiniBand for WEKA, FastRack for GCP).