Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Allenai Open instruct CUDA GPU Training

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

NVIDIA CUDA GPU environment required for all training and GPU-accelerated operations in Open Instruct.

Description

All training workflows (SFT, DPO, GRPO, Reward Modeling) require NVIDIA CUDA GPUs. The repository uses PyTorch with CUDA backend for all tensor operations, distributed training via NCCL, and vLLM for inference during GRPO. Tests are conditionally skipped when CUDA is not available, and platform-specific packages (vLLM, flash-attn, bitsandbytes, liger-kernel) are excluded on macOS.

Usage

Use this environment for any training or GPU-accelerated evaluation task. All training scripts (SFT via finetune.py, DPO via dpo_tune_cache.py, GRPO via grpo_fast.py, Reward Modeling via reward_modeling.py) require CUDA. CPU-only execution is limited to dataset preprocessing, testing non-GPU paths, and utility scripts.

System Requirements

Category Requirement Notes
OS Ubuntu 22.04 LTS Docker base image uses nvidia/cuda:12.9.0-devel-ubuntu22.04
Hardware NVIDIA GPU with CUDA support Minimum 1 GPU; 8x GPU nodes typical for distributed training
VRAM 16GB+ per GPU A100 (40/80GB) or H100 (80GB) recommended for large models
Disk 50GB+ SSD For model weights, datasets, and checkpoints
Network InfiniBand or high-speed Ethernet Required for multi-node training via NCCL

Dependencies

System Packages

  • CUDA Toolkit 12.9
  • cuDNN (bundled with CUDA toolkit)
  • NVIDIA DOCA OFED drivers (version 2.10.0 for Mellanox networking)
  • Mellanox Firmware Tools (MFT version 4.31.0-149)

Python Packages

  • `torch` >= 2.9.0, < 2.10 (with CUDA 12.9 backend)
  • `deepspeed` >= 0.18.3
  • `flash-attn` >= 2.8.3 (Linux x86_64 only)
  • `bitsandbytes` >= 0.44.1 (Linux only)
  • `liger-kernel` >= 0.5.4 (Linux only)

Credentials

The following environment variables are relevant for GPU configuration:

  • `CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible (default: "0,1,2,3,4,5,6,7")
  • `NCCL_CUMEM_ENABLE`: Must be set to "0" for vLLM compatibility
  • `NCCL_DEBUG`: Debug level for NCCL (typically "ERROR")

Quick Install

# Install via uv (recommended)
uv sync

# Or install key GPU packages manually
pip install torch>=2.9.0 deepspeed>=0.18.3 flash-attn>=2.8.3 bitsandbytes>=0.44.1 liger-kernel>=0.5.4

Code Evidence

GPU availability check from `conftest.py:19-20`:

if not torch.cuda.is_available():
    collect_ignore.extend(str(p) for p in pathlib.Path("open_instruct").glob("*_gpu.py"))

CUDA device setup from `grpo_fast.py:202`:

torch.cuda.set_device(self.local_rank)
self.device = torch.device(self.local_rank)

Platform-conditional dependencies from `pyproject.toml:11,30,34-35`:

"bitsandbytes>=0.44.1; platform_system != 'Darwin'",
"vllm==0.14.1; platform_system != 'Darwin'",
"flash-attn>=2.8.3; platform_system != 'Darwin' and platform_machine != 'aarch64'",
"liger-kernel>=0.5.4; platform_system != 'Darwin'",

Default GPU configuration from `utils.py:1558-1561`:

cuda_visible_devices = [int(x) for x in os.environ.get(
    "CUDA_VISIBLE_DEVICES", "0,1,2,3,4,5,6,7"
).split(",")]

Common Errors

Error Message Cause Solution
`RuntimeError: 0 active drivers ([])` DeepSpeed imported on CPU-only machine Already handled via try/except in utils.py; no action needed
`CUDA out of memory` Insufficient GPU VRAM for model size Reduce batch size, enable gradient checkpointing, or use DeepSpeed ZeRO-3
Padding-free tests skipped CUDA not available Install NVIDIA drivers and CUDA toolkit

Compatibility Notes

  • macOS (Darwin): vLLM, bitsandbytes, flash-attn, and liger-kernel are excluded. Only CPU-based dataset processing and testing is supported.
  • ARM Linux (aarch64): flash-attn is not supported. Uses PyTorch with CUDA 13.0 index instead of 12.9.
  • Multi-node: Requires cluster-specific NCCL configuration (InfiniBand for WEKA, FastRack for GCP).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment