Environment:Alibaba ROLL DeepSpeed Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
Microsoft DeepSpeed training backend environment with ZeRO optimization (stages 0-3), CPU offloading, and gradient checkpointing for memory-efficient distributed LLM training.
Description
This environment provides the DeepSpeed distributed training backend for ROLL. DeepSpeed's ZeRO (Zero Redundancy Optimizer) partitions optimizer states (stage 1), gradients (stage 2), and parameters (stage 3) across data parallel ranks to reduce per-GPU memory. The framework includes custom patches for offload state management that allow GPU-to-CPU state migration during non-training phases. The offload implementation currently supports optimizer parameters only (not gradients), and requires careful cleanup of extra references to avoid memory leaks.
Usage
Use this environment when training with the deepspeed strategy backend. DeepSpeed is the most broadly compatible training backend, supporting NVIDIA CUDA, AMD ROCm, and Huawei Ascend NPU. Choose the ZeRO stage based on model size and available GPU memory.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA, AMD, or Ascend GPU | Cross-platform support |
| VRAM | Depends on ZeRO stage | ZeRO-3 + CPU offload uses least VRAM |
Dependencies
Python Packages
- `deepspeed` == 0.16.4
- `torch` >= 2.6.0
- All common dependencies from `requirements_common.txt`
Credentials
No additional credentials required beyond the base CUDA/ROCm/NPU environment.
Quick Install
pip install deepspeed==0.16.4
Code Evidence
Optimizer state offloading from `roll/distributed/strategy/deepspeed_strategy.py:456`:
# TODO: The offload option may be integrated into the pipeline config in the future.
is_offload_optimizer_states_in_train_step = data.meta_info.get(
"is_offload_optimizer_states_in_train_step", True
)
Offload limitation note from `roll/third_party/deepspeed/offload_states_patch.py:183`:
# NOTE: Only supports offloading optimizer parameters (not gradients)
KV cache control from `roll/distributed/strategy/deepspeed_strategy.py:184,228`:
# Training: set use_cache=False to save memory
use_cache=False,
# Inference: set use_cache=True for faster generation
use_cache=True,
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` during training | ZeRO stage too low for model size | Upgrade ZeRO stage (2 -> 3) or enable CPU offloading |
| Slow training with ZeRO-3 | Parameter gathering overhead | Use ZeRO-2 if model fits, or enable offload only for optimizer |
Compatibility Notes
- Cross-platform: Works on NVIDIA CUDA, AMD ROCm, and Huawei Ascend NPU.
- ZeRO Stages: Pre-configured YAML files: `deepspeed_zero.yaml`, `deepspeed_zero2.yaml`, `deepspeed_zero3.yaml`, `deepspeed_zero3_cpuoffload.yaml`.
- LoRA: Compatible with LoRA fine-tuning; check compatibility settings.
- Offload: Optimizer state offloading enabled by default in train_step.
- Diffusion: Used as the primary backend for Reward Flow Diffusion pipeline.