Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Huggingface Open r1 CUDA Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning
Last Updated 2026-02-08 00:00 GMT

Overview

Linux environment with CUDA 12.4, Python 3.10+, PyTorch 2.6.0, and 8x NVIDIA H100 (80GB) GPUs for training and inference.

Description

This environment provides the GPU-accelerated context required for all Open-R1 training and inference workflows. It is built around PyTorch 2.6.0 with CUDA 12.4 and includes the full HuggingFace ecosystem (Transformers, TRL, Accelerate, DeepSpeed). The hardware baseline assumes a node of 8x NVIDIA H100 (80GB) GPUs, though smaller configurations are possible with batch size and gradient accumulation adjustments. Mixed precision training uses bf16 throughout. The environment supports distributed training strategies including FSDP, DDP, and DeepSpeed ZeRO stages 2 and 3.

Usage

Use this environment for any Model Training (SFT or GRPO), Pass Rate Filtering, or Model Loading workflow that requires GPU acceleration. It is the mandatory prerequisite for running the SFTTrainer, GRPOTrainer, and vLLM-based pass rate computation implementations.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) README shows uv-based installation on Linux
Hardware NVIDIA GPU with CUDA support Minimum 1 GPU; reference config is 8x H100 80GB
CUDA 12.4 Segmentation faults occur with mismatched CUDA versions; verify with nvcc --version
Python >= 3.10.9 Specified in setup.py python_requires
Disk 50GB+ SSD For model weights, datasets, and checkpoints

Dependencies

System Packages

  • cuda-toolkit = 12.4
  • git-lfs (for model/dataset push to Hub)

Python Packages (Core)

  • torch == 2.6.0
  • transformers == 4.52.3
  • trl[vllm] == 0.18.0
  • accelerate == 1.4.0
  • deepspeed == 0.16.8
  • datasets >= 3.2.0
  • bitsandbytes >= 0.43.0
  • peft >= 0.14.0
  • einops >= 0.8.0
  • liger-kernel >= 0.5.10
  • safetensors >= 0.3.3
  • sentencepiece >= 0.1.99
  • huggingface-hub[cli,hf_xet] >= 0.30.2, < 1.0
  • wandb >= 0.19.1
  • math-verify == 0.5.2
  • latex2sympy2_extended >= 1.0.6
  • packaging >= 23.0
  • hf_transfer >= 0.1.4
  • langdetect
  • async-lru >= 2.0.5

Python Packages (Optional Extras)

  • vllm == 0.8.5.post1 (must be installed separately before the package)
  • flash-attn (install with --no-build-isolation)
  • lighteval (for evaluation; pinned to specific git commit)
  • e2b-code-interpreter >= 1.0.5 (for code reward execution)
  • morphcloud == 0.1.67 (for MorphCloud code execution)

Credentials

The following environment variables must be set:

  • HF_TOKEN: HuggingFace API token for model/dataset access and Hub push (set via huggingface-cli login).
  • WANDB_API_KEY: Weights & Biases API key for experiment logging (set via wandb login).

Optional credentials (depending on workflow):

  • WANDB_ENTITY: W&B entity name (can be set via training config).
  • WANDB_PROJECT: W&B project name (can be set via training config).
  • WANDB_RUN_GROUP: W&B run group (can be set via training config).

Quick Install

# Create virtual environment
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip

# Install vLLM first (bundles PyTorch 2.6.0)
uv pip install vllm==0.8.5.post1

# Install FlashAttention
uv pip install setuptools && uv pip install flash-attn --no-build-isolation

# Install Open-R1 with all development extras
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"

# Login to HuggingFace and W&B
huggingface-cli login
wandb login

Code Evidence

CUDA 12.4 requirement from README.md:51-52:

Libraries rely on CUDA 12.4. If you see errors related to segmentation faults,
double check the version your system is running with `nvcc --version`.

Python version constraint from setup.py:137:

python_requires=">=3.10.9",

PyTorch version pin from setup.py:70:

"torch==2.6.0",

vLLM version constraint from README.md:72-76:

uv pip install vllm==0.8.5.post1
This will also install PyTorch v2.6.0 and it is very important to use this
version since the vLLM binaries are compiled for it.

Hardware assumption from README.md:104:

The training commands below are configured for a node of 8 x H100s (80GB).
For different hardware and topologies, you may need to tune the batch size
and number of gradient accumulation steps.

Common Errors

Error Message Cause Solution
Segmentation fault during import CUDA version mismatch Verify CUDA 12.4 is installed: nvcc --version
ImportError: No module named 'flash_attn' FlashAttention not installed uv pip install setuptools && uv pip install flash-attn --no-build-isolation
RuntimeError: CUDA out of memory Insufficient GPU VRAM Reduce per_device_train_batch_size or enable gradient checkpointing (--gradient_checkpointing)
stale open_r1.egg-info warning Leftover build artifacts after update The setup.py auto-removes this directory; safe to ignore
uv cache warnings Default link mode incompatible on cluster Add export UV_LINK_MODE=copy to .bashrc

Compatibility Notes

  • PyTorch version: Must be exactly 2.6.0. The vLLM 0.8.5.post1 binaries are compiled against this version; mismatched PyTorch will cause runtime failures.
  • CUDA version: Must be 12.4. Other CUDA versions can cause segmentation faults.
  • bf16 precision: All accelerate configs use mixed_precision: bf16. This requires Ampere (A100) or newer GPUs.
  • 8-GPU default: All provided accelerate configs (fsdp.yaml, zero2.yaml, zero3.yaml) default to num_processes: 8. Adjust for your hardware.
  • FSDP activation checkpointing: Currently disabled in fsdp.yaml pending a Transformers fix (PR #36610).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment