Environment:OpenRLHF OpenRLHF DeepSpeed Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
DeepSpeed == 0.18.5 with ZeRO stages 0-3, tensor parallelism, and optional CPU offloading for distributed RLHF training.
Description
This environment provides the DeepSpeed distributed training framework configuration required by all non-Ray OpenRLHF training workflows (SFT, DPO, RM, KD, KTO, PRM). DeepSpeed manages ZeRO optimization stages (0-3), optimizer state partitioning, gradient accumulation, and optional CPU offloading. Tensor parallelism requires DeepSpeed >= 0.16.4, and state offloading on ZeRO stages other than 3 requires DeepSpeed > 0.17.5.
Usage
Use this environment for all OpenRLHF training workflows that use the `DeepspeedStrategy`. This includes SFT, DPO, Reward Model, Knowledge Distillation, KTO, and PRM training. The PPO trainer also uses DeepSpeed for its actor and critic training components within Ray actors.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| GPU | NVIDIA CUDA GPU | Required for FusedAdam optimizer and NCCL backend |
| CPU RAM | Proportional to model size | Required when `--adam_offload` or `--offload` enabled |
| Disk | SSD recommended | For checkpoint saving with ZeRO-aware serialization |
Dependencies
Python Packages
- `deepspeed` == 0.18.5 (pinned in requirements.txt)
- `torch` with distributed support
- `peft` (for PEFT model state dict saving)
- `torchdata` (for StatefulDataLoader)
Credentials
No additional credentials beyond the base CUDA GPU environment.
Quick Install
pip install deepspeed==0.18.5
Code Evidence
DeepSpeed version requirement for tensor parallelism from `openrlhf/utils/deepspeed/deepspeed.py:72-73`:
if self.ds_tensor_parallel_size > 1:
assert deepspeed.version >= "0.16.4", "DeepSpeed version must be >= 0.16.4 for tensor parallel training"
State offloading version constraint from `openrlhf/utils/deepspeed/deepspeed_utils.py:151-153`:
if zero_stage != 3 and version.parse(deepspeed.__version__) <= version.parse("0.17.5"):
raise NotImplementedError(
"Only Zero stage 3 is currently supported when using DeepSpeed version 0.17.5 or lower"
)
DeepCompile disabled for inference from `openrlhf/utils/deepspeed/deepspeed_utils.py:77-79`:
# At least for 0.16.6, DeepCompile hasn't support pure inference mode
# https://github.com/deepspeedai/DeepSpeed/pull/7225
deepcompile = False
ZeRO configuration with offloading from `openrlhf/utils/deepspeed/deepspeed_utils.py:20-43`:
device = "cpu" if offload else "none"
zero_opt_dict = {
"stage": stage,
"offload_param": {"device": device},
"offload_optimizer": {
"device": "cpu" if adam_offload else "none",
"pin_memory": True,
},
...
}
if stage == 3:
zero_opt_dict["reduce_scatter"] = True
Optimizer selection from `openrlhf/utils/deepspeed/deepspeed.py:138`:
AdamOptimizer = DeepSpeedCPUAdam if self.adam_offload else FusedAdam
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `DeepSpeed version must be >= 0.16.4 for tensor parallel training` | Old DeepSpeed with `--ds_tensor_parallel_size > 1` | Upgrade to `deepspeed >= 0.16.4` |
| `Only Zero stage 3 is currently supported when using DeepSpeed version 0.17.5 or lower` | State offloading on ZeRO stage != 3 with old DeepSpeed | Upgrade to `deepspeed > 0.17.5` or use `--zero_stage 3` |
| DeepCompile inference error | DeepCompile not supported for pure inference | Automatically disabled; no action needed |
Compatibility Notes
- ZeRO Stage 3: Enables `reduce_scatter` and requires `GatheredParameters` context for parameter access.
- CPU Offloading: `--adam_offload` moves optimizer states to CPU; `--offload` moves parameters to CPU. When adam_offload is active, additional state offloading is skipped automatically.
- DeepCompile: Disabled for inference mode as of DeepSpeed 0.16.6. Only usable during training.
- Tensor Parallelism: Requires DeepSpeed >= 0.16.4 and bf16 dtype.