Environment:Microsoft DeepSpeedExamples RLHF Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, RLHF, Infrastructure |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
Linux environment with PyTorch >= 1.12.0, DeepSpeed >= 0.9.0, and HuggingFace Transformers >= 4.31.0 for multi-GPU RLHF training of language models up to 175B parameters.
Description
This environment provides the full stack required to run the DeepSpeed-Chat three-step RLHF pipeline: Supervised Fine-Tuning (SFT), Reward Model Training, and RLHF fine-tuning with PPO. It supports single-GPU training for small models (OPT-1.3B on A6000), single-node multi-GPU for medium models (OPT-13B on 8xA100-40GB), and multi-node distributed training for large models (OPT-66B on 64xA100-80GB). The environment uses DeepSpeed ZeRO optimization (stages 0-3) and optionally the Hybrid Engine for accelerated generation during RLHF.
Usage
Use this environment for any workflow involving the DeepSpeed-Chat RLHF training pipeline, including supervised fine-tuning (Step 1), reward model training (Step 2), and PPO-based RLHF fine-tuning (Step 3). It is the mandatory prerequisite for the Create_Prompt_Dataset, Create_HF_Model, Create_Critic_Model, DeepSpeedRLHFEngine, DeepSpeedPPOTrainer, and Prompt_Eval implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | Ubuntu 20.04+ recommended; NCCL backend required for distributed training |
| Hardware (Single GPU) | NVIDIA A6000 (48GB VRAM) | Trains OPT-1.3B in ~2.2 hours |
| Hardware (Single Node) | 8x NVIDIA A100-40GB | Trains OPT-13B in ~13.6 hours |
| Hardware (Multi-Node) | 8 DGX nodes with 8x A100-80GB | Trains OPT-66B in <9 hours |
| CPU | Multi-core | Required for data preprocessing and distributed coordination |
| Disk | SSD recommended | For dataset caching and checkpoint storage |
Dependencies
System Packages
- CUDA Toolkit (11.x or 12.x)
- NCCL (for multi-GPU communication)
- `deepspeed` launcher or `torch.distributed.launch`
Python Packages
- `torch` >= 1.12.0
- `deepspeed` >= 0.9.0
- `transformers` >= 4.31.0, != 4.33.2
- `datasets` >= 2.8.0
- `sentencepiece` >= 0.1.97
- `protobuf` == 3.20.3
- `accelerate` >= 0.15.0
- `tensorboard`
Credentials
No specific API credentials required. Model weights are loaded from HuggingFace Hub using public model identifiers (e.g., `facebook/opt-1.3b`, `meta-llama/Llama-2-7b-hf`). If using gated models like Llama-2, a `HF_TOKEN` environment variable may be required for download access.
Quick Install
# Install all required packages
pip install "torch>=1.12.0" "deepspeed>=0.9.0" "transformers>=4.31.0,!=4.33.2" \
"datasets>=2.8.0" "sentencepiece>=0.1.97" "protobuf==3.20.3" \
"accelerate>=0.15.0" tensorboard
# Install DeepSpeed-Chat package
cd applications/DeepSpeed-Chat && pip install .
Code Evidence
Requirements from `applications/DeepSpeed-Chat/requirements.txt`:
datasets>=2.8.0
sentencepiece>=0.1.97
protobuf==3.20.3
accelerate>=0.15.0
torch>=1.12.0
deepspeed>=0.9.0
transformers>=4.31.0,!=4.33.2
tensorboard
Device detection from `training/cifar/cifar10_deepspeed.py:10`:
from deepspeed.accelerator import get_accelerator
ZeRO-3 configuration from `dschat/utils/ds_utils.py:40-50`:
"zero_optimization": {
"stage": 3,
"offload_param": {"device": offload_device},
"offload_optimizer": {"device": offload_device},
"stage3_param_persistence_threshold": 1e4,
"stage3_max_live_parameters": 3e7,
"stage3_prefetch_bucket_size": 3e7,
}
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` | Model too large for available VRAM | Enable ZeRO Stage 3, gradient checkpointing, or use LoRA |
| `transformers 4.33.2 is incompatible` | Known bug in transformers 4.33.2 | Install transformers >= 4.31.0 but != 4.33.2 |
| `NCCL error: unhandled system error` | Multi-GPU communication failure | Verify NCCL installation and network configuration between nodes |
| `RuntimeError: Expected all tensors on same device` | Device mismatch in distributed setup | Ensure `get_accelerator().set_device(local_rank)` is called before model creation |
Compatibility Notes
- Single GPU: Supports OPT up to 1.3B (full fine-tuning) or up to 6.7B (with LoRA)
- Multi-GPU: Required for models larger than 6.7B parameters
- Llama-2 70B: Supported with ZeRO-Offload but NOT with Hybrid Engine
- BLOOM models: Community support needed; not fully tested
- Windows: Not officially supported; use WSL2 or Linux
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Create_Prompt_Dataset
- Implementation:Microsoft_DeepSpeedExamples_Create_HF_Model
- Implementation:Microsoft_DeepSpeedExamples_Create_Critic_Model
- Implementation:Microsoft_DeepSpeedExamples_DeepSpeedRLHFEngine
- Implementation:Microsoft_DeepSpeedExamples_DeepSpeedPPOTrainer
- Implementation:Microsoft_DeepSpeedExamples_Prompt_Eval