Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:OpenRLHF OpenRLHF DeepSpeed Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-07 10:00 GMT

Overview

DeepSpeed == 0.18.5 with ZeRO stages 0-3, tensor parallelism, and optional CPU offloading for distributed RLHF training.

Description

This environment provides the DeepSpeed distributed training framework configuration required by all non-Ray OpenRLHF training workflows (SFT, DPO, RM, KD, KTO, PRM). DeepSpeed manages ZeRO optimization stages (0-3), optimizer state partitioning, gradient accumulation, and optional CPU offloading. Tensor parallelism requires DeepSpeed >= 0.16.4, and state offloading on ZeRO stages other than 3 requires DeepSpeed > 0.17.5.

Usage

Use this environment for all OpenRLHF training workflows that use the `DeepspeedStrategy`. This includes SFT, DPO, Reward Model, Knowledge Distillation, KTO, and PRM training. The PPO trainer also uses DeepSpeed for its actor and critic training components within Ray actors.

System Requirements

Category Requirement Notes
GPU NVIDIA CUDA GPU Required for FusedAdam optimizer and NCCL backend
CPU RAM Proportional to model size Required when `--adam_offload` or `--offload` enabled
Disk SSD recommended For checkpoint saving with ZeRO-aware serialization

Dependencies

Python Packages

  • `deepspeed` == 0.18.5 (pinned in requirements.txt)
  • `torch` with distributed support
  • `peft` (for PEFT model state dict saving)
  • `torchdata` (for StatefulDataLoader)

Credentials

No additional credentials beyond the base CUDA GPU environment.

Quick Install

pip install deepspeed==0.18.5

Code Evidence

DeepSpeed version requirement for tensor parallelism from `openrlhf/utils/deepspeed/deepspeed.py:72-73`:

if self.ds_tensor_parallel_size > 1:
    assert deepspeed.version >= "0.16.4", "DeepSpeed version must be >= 0.16.4 for tensor parallel training"

State offloading version constraint from `openrlhf/utils/deepspeed/deepspeed_utils.py:151-153`:

if zero_stage != 3 and version.parse(deepspeed.__version__) <= version.parse("0.17.5"):
    raise NotImplementedError(
        "Only Zero stage 3 is currently supported when using DeepSpeed version 0.17.5 or lower"
    )

DeepCompile disabled for inference from `openrlhf/utils/deepspeed/deepspeed_utils.py:77-79`:

# At least for 0.16.6, DeepCompile hasn't support pure inference mode
# https://github.com/deepspeedai/DeepSpeed/pull/7225
deepcompile = False

ZeRO configuration with offloading from `openrlhf/utils/deepspeed/deepspeed_utils.py:20-43`:

device = "cpu" if offload else "none"
zero_opt_dict = {
    "stage": stage,
    "offload_param": {"device": device},
    "offload_optimizer": {
        "device": "cpu" if adam_offload else "none",
        "pin_memory": True,
    },
    ...
}
if stage == 3:
    zero_opt_dict["reduce_scatter"] = True

Optimizer selection from `openrlhf/utils/deepspeed/deepspeed.py:138`:

AdamOptimizer = DeepSpeedCPUAdam if self.adam_offload else FusedAdam

Common Errors

Error Message Cause Solution
`DeepSpeed version must be >= 0.16.4 for tensor parallel training` Old DeepSpeed with `--ds_tensor_parallel_size > 1` Upgrade to `deepspeed >= 0.16.4`
`Only Zero stage 3 is currently supported when using DeepSpeed version 0.17.5 or lower` State offloading on ZeRO stage != 3 with old DeepSpeed Upgrade to `deepspeed > 0.17.5` or use `--zero_stage 3`
DeepCompile inference error DeepCompile not supported for pure inference Automatically disabled; no action needed

Compatibility Notes

  • ZeRO Stage 3: Enables `reduce_scatter` and requires `GatheredParameters` context for parameter access.
  • CPU Offloading: `--adam_offload` moves optimizer states to CPU; `--offload` moves parameters to CPU. When adam_offload is active, additional state offloading is skipped automatically.
  • DeepCompile: Disabled for inference mode as of DeepSpeed 0.16.6. Only usable during training.
  • Tensor Parallelism: Requires DeepSpeed >= 0.16.4 and bf16 dtype.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment