Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Microsoft DeepSpeedExamples RLHF Training Pipeline

From Leeroopedia


Knowledge Sources
Domains LLMs, RLHF, Fine_Tuning, Distributed_Training
Last Updated 2026-02-07 13:00 GMT

Overview

End-to-end Reinforcement Learning from Human Feedback (RLHF) training pipeline for aligning Large Language Models, following the InstructGPT methodology with three sequential stages: Supervised Fine-Tuning, Reward Model Training, and PPO-based RLHF.

Description

This workflow implements the complete DeepSpeed-Chat RLHF pipeline for training instruction-following language models. It follows the three-step approach introduced by OpenAI's InstructGPT paper:

Goal: A fully aligned language model that follows human instructions, trained through preference-based reinforcement learning.

Scope: Covers the entire pipeline from raw pretrained model to instruction-aligned model, including data preparation, supervised fine-tuning (SFT), reward model training, and Proximal Policy Optimization (PPO) based RLHF.

Strategy: Uses DeepSpeed ZeRO optimization (stages 0-3) to enable training models from 1.3B to 175B parameters. Supports Hybrid Engine for accelerating generation during RLHF, LoRA for parameter-efficient training, and hybrid ZeRO configurations where different models use different ZeRO stages.

Usage

Execute this workflow when you need to align a pretrained language model (such as OPT, LLaMA-2, or BLOOM) to follow human instructions and produce helpful, harmless responses. This is appropriate when you have access to instruction-following datasets and human preference data, and need to produce a chat-capable model from a base pretrained model.

Execution Steps

Step 1: Data Preparation

Prepare and partition datasets for all three training phases. The pipeline uses a unified data system that supports 15+ dataset sources (Dahoas, HH-RLHF, Stanford, OpenAI, etc.) and splits them across the three phases using a configurable ratio (default 2:4:4 mapping to 60%/20%/20%).

Key considerations:

  • Each phase requires a different data format: Phase 1 uses instruction-response pairs for causal LM training, Phase 2 uses chosen/rejected response pairs for reward modeling, Phase 3 uses prompts only for generation
  • The tokenizer must be configured with proper end-of-conversation tokens
  • Data is cached using SHA256 hashes of configuration for efficient reuse across runs

Step 2: Supervised Fine_Tuning (SFT)

Fine-tune the base pretrained model on instruction-following data using standard causal language modeling loss. This produces an actor model that can follow instructions but is not yet optimized for quality or safety.

What happens:

  • Load pretrained model (e.g., OPT-1.3B, LLaMA-2-7B) with optional 4-bit quantization
  • Configure DeepSpeed ZeRO optimization for memory-efficient distributed training
  • Train on instruction-response pairs with cosine learning rate scheduling
  • Optionally apply LoRA for parameter-efficient fine-tuning
  • Evaluate using validation perplexity
  • Save the fine-tuned model checkpoint

Step 3: Reward Model Training

Train a reward model that scores response quality by learning from human preference pairs (chosen vs. rejected responses). This model provides the reward signal for the subsequent RLHF step.

What happens:

  • Initialize a reward model by adding a linear value head to a pretrained language model
  • Train on paired preference data using a Bradley-Terry ranking loss
  • The model learns to assign higher scores to preferred responses
  • Evaluate using accuracy metric (fraction of pairs where chosen score exceeds rejected)
  • Save the reward model checkpoint for use in Step 4

Step 4: RLHF Engine Initialization

Initialize the four-model RLHF engine that manages the actor, critic, reward, and reference models simultaneously. Each model can use a different ZeRO optimization stage for optimal memory management.

What happens:

  • Load the SFT model as the actor (trainable) and reference model (frozen copy for KL divergence)
  • Load the reward model from Step 3 (frozen, provides reward signals)
  • Initialize the critic model with a value head (trainable, estimates advantages)
  • Configure hybrid ZeRO stages per model (e.g., ZeRO-3 for actor, ZeRO-0 for reward)
  • Optionally enable Hybrid Engine for accelerated generation during training

Step 5: PPO Training Loop

Execute the Proximal Policy Optimization training loop that iteratively generates responses, computes rewards, and updates the actor and critic models.

What happens:

  • Actor generates completions for training prompts using the current policy
  • Reward model scores the generated responses
  • Compute advantages using Generalized Advantage Estimation (GAE) with gamma=1.0 and lambda=0.95
  • Combine reward signal with KL divergence penalty (weight 0.1) to prevent policy drift
  • Update actor using PPO clipped objective (clip range 0.2)
  • Update critic using value function MSE loss with clipped updates
  • Optionally mix in unsupervised language modeling loss for stability

Step 6: Model Evaluation and Export

Evaluate the trained model's quality and save the final aligned model for deployment.

What happens:

  • Run the aligned model on evaluation prompts to assess instruction-following quality
  • Compare outputs against the baseline SFT model to measure alignment improvement
  • Save the final actor model checkpoint
  • Optionally export model for inference serving

Execution Diagram

GitHub URL

Workflow Repository