Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Alibaba ROLL DeepSpeed Training Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-07 19:00 GMT

Overview

Microsoft DeepSpeed training backend environment with ZeRO optimization (stages 0-3), CPU offloading, and gradient checkpointing for memory-efficient distributed LLM training.

Description

This environment provides the DeepSpeed distributed training backend for ROLL. DeepSpeed's ZeRO (Zero Redundancy Optimizer) partitions optimizer states (stage 1), gradients (stage 2), and parameters (stage 3) across data parallel ranks to reduce per-GPU memory. The framework includes custom patches for offload state management that allow GPU-to-CPU state migration during non-training phases. The offload implementation currently supports optimizer parameters only (not gradients), and requires careful cleanup of extra references to avoid memory leaks.

Usage

Use this environment when training with the deepspeed strategy backend. DeepSpeed is the most broadly compatible training backend, supporting NVIDIA CUDA, AMD ROCm, and Huawei Ascend NPU. Choose the ZeRO stage based on model size and available GPU memory.

System Requirements

Category Requirement Notes
Hardware NVIDIA, AMD, or Ascend GPU Cross-platform support
VRAM Depends on ZeRO stage ZeRO-3 + CPU offload uses least VRAM

Dependencies

Python Packages

  • `deepspeed` == 0.16.4
  • `torch` >= 2.6.0
  • All common dependencies from `requirements_common.txt`

Credentials

No additional credentials required beyond the base CUDA/ROCm/NPU environment.

Quick Install

pip install deepspeed==0.16.4

Code Evidence

Optimizer state offloading from `roll/distributed/strategy/deepspeed_strategy.py:456`:

# TODO: The offload option may be integrated into the pipeline config in the future.
is_offload_optimizer_states_in_train_step = data.meta_info.get(
    "is_offload_optimizer_states_in_train_step", True
)

Offload limitation note from `roll/third_party/deepspeed/offload_states_patch.py:183`:

# NOTE: Only supports offloading optimizer parameters (not gradients)

KV cache control from `roll/distributed/strategy/deepspeed_strategy.py:184,228`:

# Training: set use_cache=False to save memory
use_cache=False,

# Inference: set use_cache=True for faster generation
use_cache=True,

Common Errors

Error Message Cause Solution
`CUDA out of memory` during training ZeRO stage too low for model size Upgrade ZeRO stage (2 -> 3) or enable CPU offloading
Slow training with ZeRO-3 Parameter gathering overhead Use ZeRO-2 if model fits, or enable offload only for optimizer

Compatibility Notes

  • Cross-platform: Works on NVIDIA CUDA, AMD ROCm, and Huawei Ascend NPU.
  • ZeRO Stages: Pre-configured YAML files: `deepspeed_zero.yaml`, `deepspeed_zero2.yaml`, `deepspeed_zero3.yaml`, `deepspeed_zero3_cpuoffload.yaml`.
  • LoRA: Compatible with LoRA fine-tuning; check compatibility settings.
  • Offload: Optimizer state offloading enabled by default in train_step.
  • Diffusion: Used as the primary backend for Reward Flow Diffusion pipeline.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment