Environment:Huggingface Open r1 CUDA Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Linux environment with CUDA 12.4, Python 3.10+, PyTorch 2.6.0, and 8x NVIDIA H100 (80GB) GPUs for training and inference.
Description
This environment provides the GPU-accelerated context required for all Open-R1 training and inference workflows. It is built around PyTorch 2.6.0 with CUDA 12.4 and includes the full HuggingFace ecosystem (Transformers, TRL, Accelerate, DeepSpeed). The hardware baseline assumes a node of 8x NVIDIA H100 (80GB) GPUs, though smaller configurations are possible with batch size and gradient accumulation adjustments. Mixed precision training uses bf16 throughout. The environment supports distributed training strategies including FSDP, DDP, and DeepSpeed ZeRO stages 2 and 3.
Usage
Use this environment for any Model Training (SFT or GRPO), Pass Rate Filtering, or Model Loading workflow that requires GPU acceleration. It is the mandatory prerequisite for running the SFTTrainer, GRPOTrainer, and vLLM-based pass rate computation implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | README shows uv-based installation on Linux |
| Hardware | NVIDIA GPU with CUDA support | Minimum 1 GPU; reference config is 8x H100 80GB |
| CUDA | 12.4 | Segmentation faults occur with mismatched CUDA versions; verify with nvcc --version
|
| Python | >= 3.10.9 | Specified in setup.py python_requires
|
| Disk | 50GB+ SSD | For model weights, datasets, and checkpoints |
Dependencies
System Packages
cuda-toolkit= 12.4git-lfs(for model/dataset push to Hub)
Python Packages (Core)
torch== 2.6.0transformers== 4.52.3trl[vllm]== 0.18.0accelerate== 1.4.0deepspeed== 0.16.8datasets>= 3.2.0bitsandbytes>= 0.43.0peft>= 0.14.0einops>= 0.8.0liger-kernel>= 0.5.10safetensors>= 0.3.3sentencepiece>= 0.1.99huggingface-hub[cli,hf_xet]>= 0.30.2, < 1.0wandb>= 0.19.1math-verify== 0.5.2latex2sympy2_extended>= 1.0.6packaging>= 23.0hf_transfer>= 0.1.4langdetectasync-lru>= 2.0.5
Python Packages (Optional Extras)
vllm== 0.8.5.post1 (must be installed separately before the package)flash-attn(install with--no-build-isolation)lighteval(for evaluation; pinned to specific git commit)e2b-code-interpreter>= 1.0.5 (for code reward execution)morphcloud== 0.1.67 (for MorphCloud code execution)
Credentials
The following environment variables must be set:
HF_TOKEN: HuggingFace API token for model/dataset access and Hub push (set viahuggingface-cli login).WANDB_API_KEY: Weights & Biases API key for experiment logging (set viawandb login).
Optional credentials (depending on workflow):
WANDB_ENTITY: W&B entity name (can be set via training config).WANDB_PROJECT: W&B project name (can be set via training config).WANDB_RUN_GROUP: W&B run group (can be set via training config).
Quick Install
# Create virtual environment
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip
# Install vLLM first (bundles PyTorch 2.6.0)
uv pip install vllm==0.8.5.post1
# Install FlashAttention
uv pip install setuptools && uv pip install flash-attn --no-build-isolation
# Install Open-R1 with all development extras
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"
# Login to HuggingFace and W&B
huggingface-cli login
wandb login
Code Evidence
CUDA 12.4 requirement from README.md:51-52:
Libraries rely on CUDA 12.4. If you see errors related to segmentation faults,
double check the version your system is running with `nvcc --version`.
Python version constraint from setup.py:137:
python_requires=">=3.10.9",
PyTorch version pin from setup.py:70:
"torch==2.6.0",
vLLM version constraint from README.md:72-76:
uv pip install vllm==0.8.5.post1
This will also install PyTorch v2.6.0 and it is very important to use this
version since the vLLM binaries are compiled for it.
Hardware assumption from README.md:104:
The training commands below are configured for a node of 8 x H100s (80GB).
For different hardware and topologies, you may need to tune the batch size
and number of gradient accumulation steps.
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| Segmentation fault during import | CUDA version mismatch | Verify CUDA 12.4 is installed: nvcc --version
|
ImportError: No module named 'flash_attn' |
FlashAttention not installed | uv pip install setuptools && uv pip install flash-attn --no-build-isolation
|
RuntimeError: CUDA out of memory |
Insufficient GPU VRAM | Reduce per_device_train_batch_size or enable gradient checkpointing (--gradient_checkpointing)
|
stale open_r1.egg-info warning |
Leftover build artifacts after update | The setup.py auto-removes this directory; safe to ignore
|
uv cache warnings |
Default link mode incompatible on cluster | Add export UV_LINK_MODE=copy to .bashrc
|
Compatibility Notes
- PyTorch version: Must be exactly 2.6.0. The vLLM 0.8.5.post1 binaries are compiled against this version; mismatched PyTorch will cause runtime failures.
- CUDA version: Must be 12.4. Other CUDA versions can cause segmentation faults.
- bf16 precision: All accelerate configs use
mixed_precision: bf16. This requires Ampere (A100) or newer GPUs. - 8-GPU default: All provided accelerate configs (
fsdp.yaml,zero2.yaml,zero3.yaml) default tonum_processes: 8. Adjust for your hardware. - FSDP activation checkpointing: Currently disabled in
fsdp.yamlpending a Transformers fix (PR #36610).