Environment:OpenRLHF OpenRLHF vLLM Environment

Knowledge Sources	OpenRLHF vLLM
Domains	Infrastructure, Inference, Distributed_Training
Last Updated	2026-02-07 10:00 GMT

Overview

vLLM > 0.8.5 (pinned at 0.15.0) with Ray integration for high-throughput generation in PPO and online RL workflows.

Description

This environment provides the vLLM inference engine required by OpenRLHF's PPO and online RL training workflows. vLLM handles generation via PagedAttention for efficient memory usage and supports tensor parallelism, sleep mode for memory conservation, and CUDA IPC for fast weight synchronization. The engine runs as a Ray actor and integrates with the training loop for on-policy and off-policy generation.

Usage

Use this environment for PPO Training, Math-GRPO Training, Rejection Sampling, and Iterative DPO workflows that require online generation. vLLM is not needed for offline training workflows like SFT, RM, DPO, or KD.

System Requirements

Category	Requirement	Notes
GPU	NVIDIA CUDA GPU	Required for vLLM inference
GPU Memory	Sufficient for model + KV cache	Configurable via `--vllm_gpu_memory_utilization` (default 0.95)
Network	NCCL-capable interconnect	For weight sync between trainer and vLLM engines

Dependencies

Python Packages

`vllm` == 0.15.0 (default extra) or `vllm` > 0.15.0 (latest extra)
`ray` == 2.48.0 (required for Ray actor backend)
`packaging` (for version comparisons)

Credentials

The following environment variables are configured automatically by the vLLM engine:

`CUDA_VISIBLE_DEVICES`: GPU device assignment for non-Ray executor backends
`VLLM_RAY_PER_WORKER_GPUS`: Number of GPUs per vLLM worker
`VLLM_RAY_BUNDLE_INDICES`: Comma-separated bundle indices for Ray placement
`VLLM_ALLOW_INSECURE_SERIALIZATION`: Set to "1" for vLLM >= 0.9.0
`VLLM_ENABLE_V1_MULTIPROCESSING`: Set to "0" for full determinism mode
`VLLM_USE_V1`: Set to "1" to use V1 engine
`RAY_ADDRESS`: Auto-detected from Ray global worker if not set

Quick Install

# Install with default vLLM version
pip install openrlhf[vllm]

# Or install with latest vLLM
pip install openrlhf[vllm_latest]

# Or install directly
pip install vllm==0.15.0

Code Evidence

Minimum version assertion from `openrlhf/trainer/ray/vllm_engine.py:86-88`:

assert version.parse(vllm.__version__) > version.parse(
    "0.8.5"
), "Streaming VLLM version must be greater than 0.8.5"

Logprobs mode version requirement from `openrlhf/trainer/ray/vllm_engine.py:254-256`:

assert version.parse(vllm.__version__) > version.parse(
    "0.10.0"
), "vLLM > 0.10.0 is required for logprobs_mode"

Version-dependent serialization from `openrlhf/trainer/ray/vllm_engine.py:90-91`:

if version.parse(vllm.__version__) >= version.parse("0.9.0"):
    os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"

Device env configuration hack from `openrlhf/trainer/ray/vllm_engine.py:67-73`:

if backend == "ray":
    # a hack to make the script work.
    # stop ray from manipulating *_VISIBLE_DEVICES
    os.environ.pop("CUDA_VISIBLE_DEVICES", None)
    os.environ.pop("ROCR_VISIBLE_DEVICES", None)
    os.environ.pop("HIP_VISIBLE_DEVICES", None)

Distributed executor backend selection from `openrlhf/trainer/ray/vllm_engine.py:200-202`:

distributed_executor_backend = "uni" if tensor_parallel_size == 1 else "ray"
use_hybrid_engine = shared_pg is not None
num_gpus = int(tensor_parallel_size == 1)

Common Errors

Error Message	Cause	Solution
`Streaming VLLM version must be greater than 0.8.5`	vLLM version too old	`pip install vllm>=0.9.0`
`vLLM > 0.10.0 is required for logprobs_mode`	Old vLLM with logprobs feature	Upgrade to `vllm > 0.10.0`
GPU memory allocation failure	vLLM KV cache exceeds VRAM	Reduce `--vllm_gpu_memory_utilization` (default 0.95)
`Agent module must contain AgentExecutor class`	Custom agent missing required class	Ensure agent Python file has `AgentExecutor` inheriting from `AgentExecutorBase`

Compatibility Notes

Tensor Parallelism: Single GPU uses "uni" backend; multi-GPU uses "ray" backend. GPU count in placement group adjusts automatically (0.2 for hybrid engine, 1 otherwise).
Hybrid Engine: When `--colocate_all_models` is set, vLLM shares GPU resources with training via placement groups and sleep mode.
Weight Sync: Supports NCCL backend (default) with optional CUDA IPC for colocated non-async training.
AMD/ROCm: Code removes `ROCR_VISIBLE_DEVICES` and `HIP_VISIBLE_DEVICES` for Ray backend, suggesting partial AMD awareness.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment