Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:OpenRLHF OpenRLHF vLLM Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Inference, Distributed_Training
Last Updated 2026-02-07 10:00 GMT

Overview

vLLM > 0.8.5 (pinned at 0.15.0) with Ray integration for high-throughput generation in PPO and online RL workflows.

Description

This environment provides the vLLM inference engine required by OpenRLHF's PPO and online RL training workflows. vLLM handles generation via PagedAttention for efficient memory usage and supports tensor parallelism, sleep mode for memory conservation, and CUDA IPC for fast weight synchronization. The engine runs as a Ray actor and integrates with the training loop for on-policy and off-policy generation.

Usage

Use this environment for PPO Training, Math-GRPO Training, Rejection Sampling, and Iterative DPO workflows that require online generation. vLLM is not needed for offline training workflows like SFT, RM, DPO, or KD.

System Requirements

Category Requirement Notes
GPU NVIDIA CUDA GPU Required for vLLM inference
GPU Memory Sufficient for model + KV cache Configurable via `--vllm_gpu_memory_utilization` (default 0.95)
Network NCCL-capable interconnect For weight sync between trainer and vLLM engines

Dependencies

Python Packages

  • `vllm` == 0.15.0 (default extra) or `vllm` > 0.15.0 (latest extra)
  • `ray` == 2.48.0 (required for Ray actor backend)
  • `packaging` (for version comparisons)

Credentials

The following environment variables are configured automatically by the vLLM engine:

  • `CUDA_VISIBLE_DEVICES`: GPU device assignment for non-Ray executor backends
  • `VLLM_RAY_PER_WORKER_GPUS`: Number of GPUs per vLLM worker
  • `VLLM_RAY_BUNDLE_INDICES`: Comma-separated bundle indices for Ray placement
  • `VLLM_ALLOW_INSECURE_SERIALIZATION`: Set to "1" for vLLM >= 0.9.0
  • `VLLM_ENABLE_V1_MULTIPROCESSING`: Set to "0" for full determinism mode
  • `VLLM_USE_V1`: Set to "1" to use V1 engine
  • `RAY_ADDRESS`: Auto-detected from Ray global worker if not set

Quick Install

# Install with default vLLM version
pip install openrlhf[vllm]

# Or install with latest vLLM
pip install openrlhf[vllm_latest]

# Or install directly
pip install vllm==0.15.0

Code Evidence

Minimum version assertion from `openrlhf/trainer/ray/vllm_engine.py:86-88`:

assert version.parse(vllm.__version__) > version.parse(
    "0.8.5"
), "Streaming VLLM version must be greater than 0.8.5"

Logprobs mode version requirement from `openrlhf/trainer/ray/vllm_engine.py:254-256`:

assert version.parse(vllm.__version__) > version.parse(
    "0.10.0"
), "vLLM > 0.10.0 is required for logprobs_mode"

Version-dependent serialization from `openrlhf/trainer/ray/vllm_engine.py:90-91`:

if version.parse(vllm.__version__) >= version.parse("0.9.0"):
    os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"

Device env configuration hack from `openrlhf/trainer/ray/vllm_engine.py:67-73`:

if backend == "ray":
    # a hack to make the script work.
    # stop ray from manipulating *_VISIBLE_DEVICES
    os.environ.pop("CUDA_VISIBLE_DEVICES", None)
    os.environ.pop("ROCR_VISIBLE_DEVICES", None)
    os.environ.pop("HIP_VISIBLE_DEVICES", None)

Distributed executor backend selection from `openrlhf/trainer/ray/vllm_engine.py:200-202`:

distributed_executor_backend = "uni" if tensor_parallel_size == 1 else "ray"
use_hybrid_engine = shared_pg is not None
num_gpus = int(tensor_parallel_size == 1)

Common Errors

Error Message Cause Solution
`Streaming VLLM version must be greater than 0.8.5` vLLM version too old `pip install vllm>=0.9.0`
`vLLM > 0.10.0 is required for logprobs_mode` Old vLLM with logprobs feature Upgrade to `vllm > 0.10.0`
GPU memory allocation failure vLLM KV cache exceeds VRAM Reduce `--vllm_gpu_memory_utilization` (default 0.95)
`Agent module must contain AgentExecutor class` Custom agent missing required class Ensure agent Python file has `AgentExecutor` inheriting from `AgentExecutorBase`

Compatibility Notes

  • Tensor Parallelism: Single GPU uses "uni" backend; multi-GPU uses "ray" backend. GPU count in placement group adjusts automatically (0.2 for hybrid engine, 1 otherwise).
  • Hybrid Engine: When `--colocate_all_models` is set, vLLM shares GPU resources with training via placement groups and sleep mode.
  • Weight Sync: Supports NCCL backend (default) with optional CUDA IPC for colocated non-async training.
  • AMD/ROCm: Code removes `ROCR_VISIBLE_DEVICES` and `HIP_VISIBLE_DEVICES` for Ray backend, suggesting partial AMD awareness.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment