Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Microsoft DeepSpeedExamples RLHF Training Environment

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, RLHF, Infrastructure
Last Updated 2026-02-07 13:00 GMT

Overview

Linux environment with PyTorch >= 1.12.0, DeepSpeed >= 0.9.0, and HuggingFace Transformers >= 4.31.0 for multi-GPU RLHF training of language models up to 175B parameters.

Description

This environment provides the full stack required to run the DeepSpeed-Chat three-step RLHF pipeline: Supervised Fine-Tuning (SFT), Reward Model Training, and RLHF fine-tuning with PPO. It supports single-GPU training for small models (OPT-1.3B on A6000), single-node multi-GPU for medium models (OPT-13B on 8xA100-40GB), and multi-node distributed training for large models (OPT-66B on 64xA100-80GB). The environment uses DeepSpeed ZeRO optimization (stages 0-3) and optionally the Hybrid Engine for accelerated generation during RLHF.

Usage

Use this environment for any workflow involving the DeepSpeed-Chat RLHF training pipeline, including supervised fine-tuning (Step 1), reward model training (Step 2), and PPO-based RLHF fine-tuning (Step 3). It is the mandatory prerequisite for the Create_Prompt_Dataset, Create_HF_Model, Create_Critic_Model, DeepSpeedRLHFEngine, DeepSpeedPPOTrainer, and Prompt_Eval implementations.

System Requirements

Category Requirement Notes
OS Linux Ubuntu 20.04+ recommended; NCCL backend required for distributed training
Hardware (Single GPU) NVIDIA A6000 (48GB VRAM) Trains OPT-1.3B in ~2.2 hours
Hardware (Single Node) 8x NVIDIA A100-40GB Trains OPT-13B in ~13.6 hours
Hardware (Multi-Node) 8 DGX nodes with 8x A100-80GB Trains OPT-66B in <9 hours
CPU Multi-core Required for data preprocessing and distributed coordination
Disk SSD recommended For dataset caching and checkpoint storage

Dependencies

System Packages

  • CUDA Toolkit (11.x or 12.x)
  • NCCL (for multi-GPU communication)
  • `deepspeed` launcher or `torch.distributed.launch`

Python Packages

  • `torch` >= 1.12.0
  • `deepspeed` >= 0.9.0
  • `transformers` >= 4.31.0, != 4.33.2
  • `datasets` >= 2.8.0
  • `sentencepiece` >= 0.1.97
  • `protobuf` == 3.20.3
  • `accelerate` >= 0.15.0
  • `tensorboard`

Credentials

No specific API credentials required. Model weights are loaded from HuggingFace Hub using public model identifiers (e.g., `facebook/opt-1.3b`, `meta-llama/Llama-2-7b-hf`). If using gated models like Llama-2, a `HF_TOKEN` environment variable may be required for download access.

Quick Install

# Install all required packages
pip install "torch>=1.12.0" "deepspeed>=0.9.0" "transformers>=4.31.0,!=4.33.2" \
    "datasets>=2.8.0" "sentencepiece>=0.1.97" "protobuf==3.20.3" \
    "accelerate>=0.15.0" tensorboard

# Install DeepSpeed-Chat package
cd applications/DeepSpeed-Chat && pip install .

Code Evidence

Requirements from `applications/DeepSpeed-Chat/requirements.txt`:

datasets>=2.8.0
sentencepiece>=0.1.97
protobuf==3.20.3
accelerate>=0.15.0
torch>=1.12.0
deepspeed>=0.9.0
transformers>=4.31.0,!=4.33.2
tensorboard

Device detection from `training/cifar/cifar10_deepspeed.py:10`:

from deepspeed.accelerator import get_accelerator

ZeRO-3 configuration from `dschat/utils/ds_utils.py:40-50`:

"zero_optimization": {
    "stage": 3,
    "offload_param": {"device": offload_device},
    "offload_optimizer": {"device": offload_device},
    "stage3_param_persistence_threshold": 1e4,
    "stage3_max_live_parameters": 3e7,
    "stage3_prefetch_bucket_size": 3e7,
}

Common Errors

Error Message Cause Solution
`CUDA out of memory` Model too large for available VRAM Enable ZeRO Stage 3, gradient checkpointing, or use LoRA
`transformers 4.33.2 is incompatible` Known bug in transformers 4.33.2 Install transformers >= 4.31.0 but != 4.33.2
`NCCL error: unhandled system error` Multi-GPU communication failure Verify NCCL installation and network configuration between nodes
`RuntimeError: Expected all tensors on same device` Device mismatch in distributed setup Ensure `get_accelerator().set_device(local_rank)` is called before model creation

Compatibility Notes

  • Single GPU: Supports OPT up to 1.3B (full fine-tuning) or up to 6.7B (with LoRA)
  • Multi-GPU: Required for models larger than 6.7B parameters
  • Llama-2 70B: Supported with ZeRO-Offload but NOT with Hybrid Engine
  • BLOOM models: Community support needed; not fully tested
  • Windows: Not officially supported; use WSL2 or Linux

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment