Environment:Huggingface Open r1 CUDA Environment

Knowledge Sources	Open R1 PyTorch CUDA
Domains	Infrastructure, Deep_Learning
Last Updated	2026-02-08 00:00 GMT

Overview

Linux environment with CUDA 12.4, Python 3.10+, PyTorch 2.6.0, and 8x NVIDIA H100 (80GB) GPUs for training and inference.

Description

This environment provides the GPU-accelerated context required for all Open-R1 training and inference workflows. It is built around PyTorch 2.6.0 with CUDA 12.4 and includes the full HuggingFace ecosystem (Transformers, TRL, Accelerate, DeepSpeed). The hardware baseline assumes a node of 8x NVIDIA H100 (80GB) GPUs, though smaller configurations are possible with batch size and gradient accumulation adjustments. Mixed precision training uses bf16 throughout. The environment supports distributed training strategies including FSDP, DDP, and DeepSpeed ZeRO stages 2 and 3.

Usage

Use this environment for any Model Training (SFT or GRPO), Pass Rate Filtering, or Model Loading workflow that requires GPU acceleration. It is the mandatory prerequisite for running the SFTTrainer, GRPOTrainer, and vLLM-based pass rate computation implementations.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	README shows uv-based installation on Linux
Hardware	NVIDIA GPU with CUDA support	Minimum 1 GPU; reference config is 8x H100 80GB
CUDA	12.4	Segmentation faults occur with mismatched CUDA versions; verify with `nvcc --version`
Python	>= 3.10.9	Specified in `setup.py` `python_requires`
Disk	50GB+ SSD	For model weights, datasets, and checkpoints

Dependencies

System Packages

cuda-toolkit = 12.4
git-lfs (for model/dataset push to Hub)

Python Packages (Core)

torch == 2.6.0
transformers == 4.52.3
trl[vllm] == 0.18.0
accelerate == 1.4.0
deepspeed == 0.16.8
datasets >= 3.2.0
bitsandbytes >= 0.43.0
peft >= 0.14.0
einops >= 0.8.0
liger-kernel >= 0.5.10
safetensors >= 0.3.3
sentencepiece >= 0.1.99
huggingface-hub[cli,hf_xet] >= 0.30.2, < 1.0
wandb >= 0.19.1
math-verify == 0.5.2
latex2sympy2_extended >= 1.0.6
packaging >= 23.0
hf_transfer >= 0.1.4
langdetect
async-lru >= 2.0.5

Python Packages (Optional Extras)

vllm == 0.8.5.post1 (must be installed separately before the package)
flash-attn (install with --no-build-isolation)
lighteval (for evaluation; pinned to specific git commit)
e2b-code-interpreter >= 1.0.5 (for code reward execution)
morphcloud == 0.1.67 (for MorphCloud code execution)

Credentials

The following environment variables must be set:

HF_TOKEN: HuggingFace API token for model/dataset access and Hub push (set via huggingface-cli login).
WANDB_API_KEY: Weights & Biases API key for experiment logging (set via wandb login).

Optional credentials (depending on workflow):

WANDB_ENTITY: W&B entity name (can be set via training config).
WANDB_PROJECT: W&B project name (can be set via training config).
WANDB_RUN_GROUP: W&B run group (can be set via training config).

Quick Install

# Create virtual environment
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip

# Install vLLM first (bundles PyTorch 2.6.0)
uv pip install vllm==0.8.5.post1

# Install FlashAttention
uv pip install setuptools && uv pip install flash-attn --no-build-isolation

# Install Open-R1 with all development extras
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"

# Login to HuggingFace and W&B
huggingface-cli login
wandb login

Code Evidence

CUDA 12.4 requirement from README.md:51-52:

Libraries rely on CUDA 12.4. If you see errors related to segmentation faults,
double check the version your system is running with `nvcc --version`.

Python version constraint from setup.py:137:

python_requires=">=3.10.9",

PyTorch version pin from setup.py:70:

"torch==2.6.0",

vLLM version constraint from README.md:72-76:

uv pip install vllm==0.8.5.post1
This will also install PyTorch v2.6.0 and it is very important to use this
version since the vLLM binaries are compiled for it.

Hardware assumption from README.md:104:

The training commands below are configured for a node of 8 x H100s (80GB).
For different hardware and topologies, you may need to tune the batch size
and number of gradient accumulation steps.

Common Errors

Error Message	Cause	Solution
Segmentation fault during import	CUDA version mismatch	Verify CUDA 12.4 is installed: `nvcc --version`
`ImportError: No module named 'flash_attn'`	FlashAttention not installed	`uv pip install setuptools && uv pip install flash-attn --no-build-isolation`
`RuntimeError: CUDA out of memory`	Insufficient GPU VRAM	Reduce `per_device_train_batch_size` or enable gradient checkpointing (`--gradient_checkpointing`)
stale `open_r1.egg-info` warning	Leftover build artifacts after update	The `setup.py` auto-removes this directory; safe to ignore
`uv` cache warnings	Default link mode incompatible on cluster	Add `export UV_LINK_MODE=copy` to `.bashrc`

Compatibility Notes

PyTorch version: Must be exactly 2.6.0. The vLLM 0.8.5.post1 binaries are compiled against this version; mismatched PyTorch will cause runtime failures.
CUDA version: Must be 12.4. Other CUDA versions can cause segmentation faults.
bf16 precision: All accelerate configs use mixed_precision: bf16. This requires Ampere (A100) or newer GPUs.
8-GPU default: All provided accelerate configs (fsdp.yaml, zero2.yaml, zero3.yaml) default to num_processes: 8. Adjust for your hardware.
FSDP activation checkpointing: Currently disabled in fsdp.yaml pending a Transformers fix (PR #36610).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment