Environment:OpenRLHF OpenRLHF vLLM Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Inference, Distributed_Training |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
vLLM > 0.8.5 (pinned at 0.15.0) with Ray integration for high-throughput generation in PPO and online RL workflows.
Description
This environment provides the vLLM inference engine required by OpenRLHF's PPO and online RL training workflows. vLLM handles generation via PagedAttention for efficient memory usage and supports tensor parallelism, sleep mode for memory conservation, and CUDA IPC for fast weight synchronization. The engine runs as a Ray actor and integrates with the training loop for on-policy and off-policy generation.
Usage
Use this environment for PPO Training, Math-GRPO Training, Rejection Sampling, and Iterative DPO workflows that require online generation. vLLM is not needed for offline training workflows like SFT, RM, DPO, or KD.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| GPU | NVIDIA CUDA GPU | Required for vLLM inference |
| GPU Memory | Sufficient for model + KV cache | Configurable via `--vllm_gpu_memory_utilization` (default 0.95) |
| Network | NCCL-capable interconnect | For weight sync between trainer and vLLM engines |
Dependencies
Python Packages
- `vllm` == 0.15.0 (default extra) or `vllm` > 0.15.0 (latest extra)
- `ray` == 2.48.0 (required for Ray actor backend)
- `packaging` (for version comparisons)
Credentials
The following environment variables are configured automatically by the vLLM engine:
- `CUDA_VISIBLE_DEVICES`: GPU device assignment for non-Ray executor backends
- `VLLM_RAY_PER_WORKER_GPUS`: Number of GPUs per vLLM worker
- `VLLM_RAY_BUNDLE_INDICES`: Comma-separated bundle indices for Ray placement
- `VLLM_ALLOW_INSECURE_SERIALIZATION`: Set to "1" for vLLM >= 0.9.0
- `VLLM_ENABLE_V1_MULTIPROCESSING`: Set to "0" for full determinism mode
- `VLLM_USE_V1`: Set to "1" to use V1 engine
- `RAY_ADDRESS`: Auto-detected from Ray global worker if not set
Quick Install
# Install with default vLLM version
pip install openrlhf[vllm]
# Or install with latest vLLM
pip install openrlhf[vllm_latest]
# Or install directly
pip install vllm==0.15.0
Code Evidence
Minimum version assertion from `openrlhf/trainer/ray/vllm_engine.py:86-88`:
assert version.parse(vllm.__version__) > version.parse(
"0.8.5"
), "Streaming VLLM version must be greater than 0.8.5"
Logprobs mode version requirement from `openrlhf/trainer/ray/vllm_engine.py:254-256`:
assert version.parse(vllm.__version__) > version.parse(
"0.10.0"
), "vLLM > 0.10.0 is required for logprobs_mode"
Version-dependent serialization from `openrlhf/trainer/ray/vllm_engine.py:90-91`:
if version.parse(vllm.__version__) >= version.parse("0.9.0"):
os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"
Device env configuration hack from `openrlhf/trainer/ray/vllm_engine.py:67-73`:
if backend == "ray":
# a hack to make the script work.
# stop ray from manipulating *_VISIBLE_DEVICES
os.environ.pop("CUDA_VISIBLE_DEVICES", None)
os.environ.pop("ROCR_VISIBLE_DEVICES", None)
os.environ.pop("HIP_VISIBLE_DEVICES", None)
Distributed executor backend selection from `openrlhf/trainer/ray/vllm_engine.py:200-202`:
distributed_executor_backend = "uni" if tensor_parallel_size == 1 else "ray"
use_hybrid_engine = shared_pg is not None
num_gpus = int(tensor_parallel_size == 1)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Streaming VLLM version must be greater than 0.8.5` | vLLM version too old | `pip install vllm>=0.9.0` |
| `vLLM > 0.10.0 is required for logprobs_mode` | Old vLLM with logprobs feature | Upgrade to `vllm > 0.10.0` |
| GPU memory allocation failure | vLLM KV cache exceeds VRAM | Reduce `--vllm_gpu_memory_utilization` (default 0.95) |
| `Agent module must contain AgentExecutor class` | Custom agent missing required class | Ensure agent Python file has `AgentExecutor` inheriting from `AgentExecutorBase` |
Compatibility Notes
- Tensor Parallelism: Single GPU uses "uni" backend; multi-GPU uses "ray" backend. GPU count in placement group adjusts automatically (0.2 for hybrid engine, 1 otherwise).
- Hybrid Engine: When `--colocate_all_models` is set, vLLM shares GPU resources with training via placement groups and sleep mode.
- Weight Sync: Supports NCCL backend (default) with optional CUDA IPC for colocated non-async training.
- AMD/ROCm: Code removes `ROCR_VISIBLE_DEVICES` and `HIP_VISIBLE_DEVICES` for Ray backend, suggesting partial AMD awareness.