Workflow:Microsoft DeepSpeedExamples RLHF Training Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RLHF, Fine_Tuning, Distributed_Training |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
End-to-end Reinforcement Learning from Human Feedback (RLHF) training pipeline for aligning Large Language Models, following the InstructGPT methodology with three sequential stages: Supervised Fine-Tuning, Reward Model Training, and PPO-based RLHF.
Description
This workflow implements the complete DeepSpeed-Chat RLHF pipeline for training instruction-following language models. It follows the three-step approach introduced by OpenAI's InstructGPT paper:
Goal: A fully aligned language model that follows human instructions, trained through preference-based reinforcement learning.
Scope: Covers the entire pipeline from raw pretrained model to instruction-aligned model, including data preparation, supervised fine-tuning (SFT), reward model training, and Proximal Policy Optimization (PPO) based RLHF.
Strategy: Uses DeepSpeed ZeRO optimization (stages 0-3) to enable training models from 1.3B to 175B parameters. Supports Hybrid Engine for accelerating generation during RLHF, LoRA for parameter-efficient training, and hybrid ZeRO configurations where different models use different ZeRO stages.
Usage
Execute this workflow when you need to align a pretrained language model (such as OPT, LLaMA-2, or BLOOM) to follow human instructions and produce helpful, harmless responses. This is appropriate when you have access to instruction-following datasets and human preference data, and need to produce a chat-capable model from a base pretrained model.
Execution Steps
Step 1: Data Preparation
Prepare and partition datasets for all three training phases. The pipeline uses a unified data system that supports 15+ dataset sources (Dahoas, HH-RLHF, Stanford, OpenAI, etc.) and splits them across the three phases using a configurable ratio (default 2:4:4 mapping to 60%/20%/20%).
Key considerations:
- Each phase requires a different data format: Phase 1 uses instruction-response pairs for causal LM training, Phase 2 uses chosen/rejected response pairs for reward modeling, Phase 3 uses prompts only for generation
- The tokenizer must be configured with proper end-of-conversation tokens
- Data is cached using SHA256 hashes of configuration for efficient reuse across runs
Step 2: Supervised Fine_Tuning (SFT)
Fine-tune the base pretrained model on instruction-following data using standard causal language modeling loss. This produces an actor model that can follow instructions but is not yet optimized for quality or safety.
What happens:
- Load pretrained model (e.g., OPT-1.3B, LLaMA-2-7B) with optional 4-bit quantization
- Configure DeepSpeed ZeRO optimization for memory-efficient distributed training
- Train on instruction-response pairs with cosine learning rate scheduling
- Optionally apply LoRA for parameter-efficient fine-tuning
- Evaluate using validation perplexity
- Save the fine-tuned model checkpoint
Step 3: Reward Model Training
Train a reward model that scores response quality by learning from human preference pairs (chosen vs. rejected responses). This model provides the reward signal for the subsequent RLHF step.
What happens:
- Initialize a reward model by adding a linear value head to a pretrained language model
- Train on paired preference data using a Bradley-Terry ranking loss
- The model learns to assign higher scores to preferred responses
- Evaluate using accuracy metric (fraction of pairs where chosen score exceeds rejected)
- Save the reward model checkpoint for use in Step 4
Step 4: RLHF Engine Initialization
Initialize the four-model RLHF engine that manages the actor, critic, reward, and reference models simultaneously. Each model can use a different ZeRO optimization stage for optimal memory management.
What happens:
- Load the SFT model as the actor (trainable) and reference model (frozen copy for KL divergence)
- Load the reward model from Step 3 (frozen, provides reward signals)
- Initialize the critic model with a value head (trainable, estimates advantages)
- Configure hybrid ZeRO stages per model (e.g., ZeRO-3 for actor, ZeRO-0 for reward)
- Optionally enable Hybrid Engine for accelerated generation during training
Step 5: PPO Training Loop
Execute the Proximal Policy Optimization training loop that iteratively generates responses, computes rewards, and updates the actor and critic models.
What happens:
- Actor generates completions for training prompts using the current policy
- Reward model scores the generated responses
- Compute advantages using Generalized Advantage Estimation (GAE) with gamma=1.0 and lambda=0.95
- Combine reward signal with KL divergence penalty (weight 0.1) to prevent policy drift
- Update actor using PPO clipped objective (clip range 0.2)
- Update critic using value function MSE loss with clipped updates
- Optionally mix in unsupervised language modeling loss for stability
Step 6: Model Evaluation and Export
Evaluate the trained model's quality and save the final aligned model for deployment.
What happens:
- Run the aligned model on evaluation prompts to assess instruction-following quality
- Compare outputs against the baseline SFT model to measure alignment improvement
- Save the final actor model checkpoint
- Optionally export model for inference serving