Environment:Alibaba ROLL DeepSpeed Training Environment

Knowledge Sources	Alibaba ROLL DeepSpeed
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-07 19:00 GMT

Overview

Microsoft DeepSpeed training backend environment with ZeRO optimization (stages 0-3), CPU offloading, and gradient checkpointing for memory-efficient distributed LLM training.

Description

This environment provides the DeepSpeed distributed training backend for ROLL. DeepSpeed's ZeRO (Zero Redundancy Optimizer) partitions optimizer states (stage 1), gradients (stage 2), and parameters (stage 3) across data parallel ranks to reduce per-GPU memory. The framework includes custom patches for offload state management that allow GPU-to-CPU state migration during non-training phases. The offload implementation currently supports optimizer parameters only (not gradients), and requires careful cleanup of extra references to avoid memory leaks.

Usage

Use this environment when training with the deepspeed strategy backend. DeepSpeed is the most broadly compatible training backend, supporting NVIDIA CUDA, AMD ROCm, and Huawei Ascend NPU. Choose the ZeRO stage based on model size and available GPU memory.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA, AMD, or Ascend GPU	Cross-platform support
VRAM	Depends on ZeRO stage	ZeRO-3 + CPU offload uses least VRAM

Dependencies

Python Packages

`deepspeed` == 0.16.4
`torch` >= 2.6.0
All common dependencies from `requirements_common.txt`

Credentials

No additional credentials required beyond the base CUDA/ROCm/NPU environment.

Quick Install

pip install deepspeed==0.16.4

Code Evidence

Optimizer state offloading from `roll/distributed/strategy/deepspeed_strategy.py:456`:

# TODO: The offload option may be integrated into the pipeline config in the future.
is_offload_optimizer_states_in_train_step = data.meta_info.get(
    "is_offload_optimizer_states_in_train_step", True
)

Offload limitation note from `roll/third_party/deepspeed/offload_states_patch.py:183`:

# NOTE: Only supports offloading optimizer parameters (not gradients)

KV cache control from `roll/distributed/strategy/deepspeed_strategy.py:184,228`:

# Training: set use_cache=False to save memory
use_cache=False,

# Inference: set use_cache=True for faster generation
use_cache=True,

Common Errors

Error Message	Cause	Solution
`CUDA out of memory` during training	ZeRO stage too low for model size	Upgrade ZeRO stage (2 -> 3) or enable CPU offloading
Slow training with ZeRO-3	Parameter gathering overhead	Use ZeRO-2 if model fits, or enable offload only for optimizer

Compatibility Notes

Cross-platform: Works on NVIDIA CUDA, AMD ROCm, and Huawei Ascend NPU.
ZeRO Stages: Pre-configured YAML files: `deepspeed_zero.yaml`, `deepspeed_zero2.yaml`, `deepspeed_zero3.yaml`, `deepspeed_zero3_cpuoffload.yaml`.
LoRA: Compatible with LoRA fine-tuning; check compatibility settings.
Offload: Optimizer state offloading enabled by default in train_step.
Diffusion: Used as the primary backend for Reward Flow Diffusion pipeline.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment