Heuristic:Microsoft DeepSpeedExamples RLHF Hyperparameter Guide

Knowledge Sources	DeepSpeed-Chat Training README InstructGPT
Domains	RLHF, Optimization, LLMs
Last Updated	2026-02-07 13:00 GMT

Overview

Step-specific hyperparameter recommendations for the 3-step RLHF pipeline: SFT uses 16 epochs with dropout enabled, reward model uses 1 epoch with weight decay, and RLHF uses 1 epoch with EMA enabled.

Description

The DeepSpeed-Chat RLHF pipeline has counter-intuitive hyperparameter requirements that differ across the three training steps. Each step has distinct optimal settings for weight decay, dropout, epoch count, and learning rate. These settings were empirically validated by the DeepSpeed team and diverge from common transformer training defaults. Incorrectly applying uniform hyperparameters across steps leads to degraded model quality.

Usage

Apply these settings when configuring any of the three RLHF training steps. These are particularly important when adapting training scripts for new model families (e.g., moving from OPT to LLaMA). The per-step settings override general best practices for transformer training.

The Insight (Rule of Thumb)

Step 1 - Supervised Fine-Tuning (SFT):

Weight Decay: 0.0 (disabled)
Dropout: Enabled
Epochs: 16 (despite PPL plateauing at 1-2 epochs, longer training improves generation quality)
Datasets: Use Dahoas/rm-static, Dahoas/full-hh-rlhf, Dahoas/synthetic-instruct-gptj-pairwise, yitingxie/rlhf-reward-datasets

Step 2 - Reward Model Training:

Weight Decay: 0.1 (enabled)
Dropout: Disabled
Epochs: 1 (following InstructGPT; overfitting does not help the reward model)

Step 3 - RLHF PPO Fine-Tuning:

Weight Decay: 0.0 (disabled for both actor and critic)
Dropout: Disabled for actor, enabled for critic
Epochs: 1 (reward score plateaus quickly)
EMA Checkpoint: Enable for better generation quality
Actor Learning Rate: 9.65e-6
Critic Learning Rate: 5e-6 (roughly 2:1 actor-to-critic ratio)

Trade-off: Step 1 requires many more epochs than expected, increasing training cost; but this directly improves downstream RLHF quality.

Reasoning

The counter-intuitive settings are grounded in the distinct objectives of each step:

Step 1 (SFT): PPL (perplexity) is a poor proxy for generation quality. The model continues to improve its ability to generate coherent responses well past the point where PPL plateaus. Disabling weight decay preserves the pretrained knowledge while dropout prevents memorization.

Step 2 (RM): The reward model must generalize across response pairs. Weight decay acts as a regularizer, while a single epoch prevents overfitting to the training distribution (InstructGPT finding). Dropout is disabled because the reward model needs stable outputs.

Step 3 (RLHF): The PPO training is inherently unstable. Disabling weight decay on the actor preserves the SFT capabilities, while EMA provides a smoothed checkpoint that reduces variance. The 1.9x higher actor learning rate accounts for the fact that the actor receives gradients through both the policy gradient and the value function.

Code Evidence:

Learning rate configuration from `training/step3_rlhf_finetuning/training_scripts/opt/single_node/run_1.3b.sh:33-34`:

Actor_Lr=9.65e-6
Critic_Lr=5e-6

EMA and sequence settings from `run_1.3b.sh:46-47,59`:

   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --enable_ema \

Performance calculation factor from `dschat/utils/perf.py:19`:

checkpoint_activations_factor = 4 if args.gradient_checkpointing else 3

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment