Environment:Allenai Open instruct CUDA GPU Training

Knowledge Sources	Open Instruct NVIDIA CUDA Toolkit
Domains	Infrastructure, Deep_Learning
Last Updated	2026-02-07 00:00 GMT

Overview

NVIDIA CUDA GPU environment required for all training and GPU-accelerated operations in Open Instruct.

Description

All training workflows (SFT, DPO, GRPO, Reward Modeling) require NVIDIA CUDA GPUs. The repository uses PyTorch with CUDA backend for all tensor operations, distributed training via NCCL, and vLLM for inference during GRPO. Tests are conditionally skipped when CUDA is not available, and platform-specific packages (vLLM, flash-attn, bitsandbytes, liger-kernel) are excluded on macOS.

Usage

Use this environment for any training or GPU-accelerated evaluation task. All training scripts (SFT via finetune.py, DPO via dpo_tune_cache.py, GRPO via grpo_fast.py, Reward Modeling via reward_modeling.py) require CUDA. CPU-only execution is limited to dataset preprocessing, testing non-GPU paths, and utility scripts.

System Requirements

Category	Requirement	Notes
OS	Ubuntu 22.04 LTS	Docker base image uses nvidia/cuda:12.9.0-devel-ubuntu22.04
Hardware	NVIDIA GPU with CUDA support	Minimum 1 GPU; 8x GPU nodes typical for distributed training
VRAM	16GB+ per GPU	A100 (40/80GB) or H100 (80GB) recommended for large models
Disk	50GB+ SSD	For model weights, datasets, and checkpoints
Network	InfiniBand or high-speed Ethernet	Required for multi-node training via NCCL

Dependencies

System Packages

CUDA Toolkit 12.9
cuDNN (bundled with CUDA toolkit)
NVIDIA DOCA OFED drivers (version 2.10.0 for Mellanox networking)
Mellanox Firmware Tools (MFT version 4.31.0-149)

Python Packages

`torch` >= 2.9.0, < 2.10 (with CUDA 12.9 backend)
`deepspeed` >= 0.18.3
`flash-attn` >= 2.8.3 (Linux x86_64 only)
`bitsandbytes` >= 0.44.1 (Linux only)
`liger-kernel` >= 0.5.4 (Linux only)

Credentials

The following environment variables are relevant for GPU configuration:

`CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible (default: "0,1,2,3,4,5,6,7")
`NCCL_CUMEM_ENABLE`: Must be set to "0" for vLLM compatibility
`NCCL_DEBUG`: Debug level for NCCL (typically "ERROR")

Quick Install

# Install via uv (recommended)
uv sync

# Or install key GPU packages manually
pip install torch>=2.9.0 deepspeed>=0.18.3 flash-attn>=2.8.3 bitsandbytes>=0.44.1 liger-kernel>=0.5.4

Code Evidence

GPU availability check from `conftest.py:19-20`:

if not torch.cuda.is_available():
    collect_ignore.extend(str(p) for p in pathlib.Path("open_instruct").glob("*_gpu.py"))

CUDA device setup from `grpo_fast.py:202`:

torch.cuda.set_device(self.local_rank)
self.device = torch.device(self.local_rank)

Platform-conditional dependencies from `pyproject.toml:11,30,34-35`:

"bitsandbytes>=0.44.1; platform_system != 'Darwin'",
"vllm==0.14.1; platform_system != 'Darwin'",
"flash-attn>=2.8.3; platform_system != 'Darwin' and platform_machine != 'aarch64'",
"liger-kernel>=0.5.4; platform_system != 'Darwin'",

Default GPU configuration from `utils.py:1558-1561`:

cuda_visible_devices = [int(x) for x in os.environ.get(
    "CUDA_VISIBLE_DEVICES", "0,1,2,3,4,5,6,7"
).split(",")]

Common Errors

Error Message	Cause	Solution
`RuntimeError: 0 active drivers ([])`	DeepSpeed imported on CPU-only machine	Already handled via try/except in utils.py; no action needed
`CUDA out of memory`	Insufficient GPU VRAM for model size	Reduce batch size, enable gradient checkpointing, or use DeepSpeed ZeRO-3
Padding-free tests skipped	CUDA not available	Install NVIDIA drivers and CUDA toolkit

Compatibility Notes

macOS (Darwin): vLLM, bitsandbytes, flash-attn, and liger-kernel are excluded. Only CPU-based dataset processing and testing is supported.
ARM Linux (aarch64): flash-attn is not supported. Uses PyTorch with CUDA 13.0 index instead of 12.9.
Multi-node: Requires cluster-specific NCCL configuration (InfiniBand for WEKA, FastRack for GCP).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment