Workflow:Microsoft DeepSpeedExamples RLHF Training Pipeline

Knowledge Sources	DeepSpeedExamples DeepSpeed Docs DeepSpeed-Chat Blog
Domains	LLMs, RLHF, Fine_Tuning, Distributed_Training
Last Updated	2026-02-07 13:00 GMT

Overview

End-to-end Reinforcement Learning from Human Feedback (RLHF) training pipeline for aligning Large Language Models, following the InstructGPT methodology with three sequential stages: Supervised Fine-Tuning, Reward Model Training, and PPO-based RLHF.

Description

This workflow implements the complete DeepSpeed-Chat RLHF pipeline for training instruction-following language models. It follows the three-step approach introduced by OpenAI's InstructGPT paper:

Goal: A fully aligned language model that follows human instructions, trained through preference-based reinforcement learning.

Scope: Covers the entire pipeline from raw pretrained model to instruction-aligned model, including data preparation, supervised fine-tuning (SFT), reward model training, and Proximal Policy Optimization (PPO) based RLHF.

Strategy: Uses DeepSpeed ZeRO optimization (stages 0-3) to enable training models from 1.3B to 175B parameters. Supports Hybrid Engine for accelerating generation during RLHF, LoRA for parameter-efficient training, and hybrid ZeRO configurations where different models use different ZeRO stages.

Usage

Execute this workflow when you need to align a pretrained language model (such as OPT, LLaMA-2, or BLOOM) to follow human instructions and produce helpful, harmless responses. This is appropriate when you have access to instruction-following datasets and human preference data, and need to produce a chat-capable model from a base pretrained model.

Execution Steps

Step 1: Data Preparation

Prepare and partition datasets for all three training phases. The pipeline uses a unified data system that supports 15+ dataset sources (Dahoas, HH-RLHF, Stanford, OpenAI, etc.) and splits them across the three phases using a configurable ratio (default 2:4:4 mapping to 60%/20%/20%).

Key considerations:

Each phase requires a different data format: Phase 1 uses instruction-response pairs for causal LM training, Phase 2 uses chosen/rejected response pairs for reward modeling, Phase 3 uses prompts only for generation
The tokenizer must be configured with proper end-of-conversation tokens
Data is cached using SHA256 hashes of configuration for efficient reuse across runs

Step 2: Supervised Fine_Tuning (SFT)

Fine-tune the base pretrained model on instruction-following data using standard causal language modeling loss. This produces an actor model that can follow instructions but is not yet optimized for quality or safety.

What happens:

Load pretrained model (e.g., OPT-1.3B, LLaMA-2-7B) with optional 4-bit quantization
Configure DeepSpeed ZeRO optimization for memory-efficient distributed training
Train on instruction-response pairs with cosine learning rate scheduling
Optionally apply LoRA for parameter-efficient fine-tuning
Evaluate using validation perplexity
Save the fine-tuned model checkpoint

Step 3: Reward Model Training

Train a reward model that scores response quality by learning from human preference pairs (chosen vs. rejected responses). This model provides the reward signal for the subsequent RLHF step.

What happens:

Initialize a reward model by adding a linear value head to a pretrained language model
Train on paired preference data using a Bradley-Terry ranking loss
The model learns to assign higher scores to preferred responses
Evaluate using accuracy metric (fraction of pairs where chosen score exceeds rejected)
Save the reward model checkpoint for use in Step 4

Step 4: RLHF Engine Initialization

Initialize the four-model RLHF engine that manages the actor, critic, reward, and reference models simultaneously. Each model can use a different ZeRO optimization stage for optimal memory management.

What happens:

Load the SFT model as the actor (trainable) and reference model (frozen copy for KL divergence)
Load the reward model from Step 3 (frozen, provides reward signals)
Initialize the critic model with a value head (trainable, estimates advantages)
Configure hybrid ZeRO stages per model (e.g., ZeRO-3 for actor, ZeRO-0 for reward)
Optionally enable Hybrid Engine for accelerated generation during training

Step 5: PPO Training Loop

Execute the Proximal Policy Optimization training loop that iteratively generates responses, computes rewards, and updates the actor and critic models.

What happens:

Actor generates completions for training prompts using the current policy
Reward model scores the generated responses
Compute advantages using Generalized Advantage Estimation (GAE) with gamma=1.0 and lambda=0.95
Combine reward signal with KL divergence penalty (weight 0.1) to prevent policy drift
Update actor using PPO clipped objective (clip range 0.2)
Update critic using value function MSE loss with clipped updates
Optionally mix in unsupervised language modeling loss for stability

Step 6: Model Evaluation and Export

Evaluate the trained model's quality and save the final aligned model for deployment.

What happens:

Run the aligned model on evaluation prompts to assess instruction-following quality
Compare outputs against the baseline SFT model to measure alignment improvement
Save the final actor model checkpoint
Optionally export model for inference serving

Execution Diagram

GitHub URL

Workflow Repository