Heuristic:Microsoft DeepSpeedExamples RLHF Hyperparameter Guide
| Knowledge Sources | |
|---|---|
| Domains | RLHF, Optimization, LLMs |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
Step-specific hyperparameter recommendations for the 3-step RLHF pipeline: SFT uses 16 epochs with dropout enabled, reward model uses 1 epoch with weight decay, and RLHF uses 1 epoch with EMA enabled.
Description
The DeepSpeed-Chat RLHF pipeline has counter-intuitive hyperparameter requirements that differ across the three training steps. Each step has distinct optimal settings for weight decay, dropout, epoch count, and learning rate. These settings were empirically validated by the DeepSpeed team and diverge from common transformer training defaults. Incorrectly applying uniform hyperparameters across steps leads to degraded model quality.
Usage
Apply these settings when configuring any of the three RLHF training steps. These are particularly important when adapting training scripts for new model families (e.g., moving from OPT to LLaMA). The per-step settings override general best practices for transformer training.
The Insight (Rule of Thumb)
Step 1 - Supervised Fine-Tuning (SFT):
- Weight Decay: 0.0 (disabled)
- Dropout: Enabled
- Epochs: 16 (despite PPL plateauing at 1-2 epochs, longer training improves generation quality)
- Datasets: Use Dahoas/rm-static, Dahoas/full-hh-rlhf, Dahoas/synthetic-instruct-gptj-pairwise, yitingxie/rlhf-reward-datasets
Step 2 - Reward Model Training:
- Weight Decay: 0.1 (enabled)
- Dropout: Disabled
- Epochs: 1 (following InstructGPT; overfitting does not help the reward model)
Step 3 - RLHF PPO Fine-Tuning:
- Weight Decay: 0.0 (disabled for both actor and critic)
- Dropout: Disabled for actor, enabled for critic
- Epochs: 1 (reward score plateaus quickly)
- EMA Checkpoint: Enable for better generation quality
- Actor Learning Rate: 9.65e-6
- Critic Learning Rate: 5e-6 (roughly 2:1 actor-to-critic ratio)
Trade-off: Step 1 requires many more epochs than expected, increasing training cost; but this directly improves downstream RLHF quality.
Reasoning
The counter-intuitive settings are grounded in the distinct objectives of each step:
Step 1 (SFT): PPL (perplexity) is a poor proxy for generation quality. The model continues to improve its ability to generate coherent responses well past the point where PPL plateaus. Disabling weight decay preserves the pretrained knowledge while dropout prevents memorization.
Step 2 (RM): The reward model must generalize across response pairs. Weight decay acts as a regularizer, while a single epoch prevents overfitting to the training distribution (InstructGPT finding). Dropout is disabled because the reward model needs stable outputs.
Step 3 (RLHF): The PPO training is inherently unstable. Disabling weight decay on the actor preserves the SFT capabilities, while EMA provides a smoothed checkpoint that reduces variance. The 1.9x higher actor learning rate accounts for the fact that the actor receives gradients through both the policy gradient and the value function.
Code Evidence:
Learning rate configuration from `training/step3_rlhf_finetuning/training_scripts/opt/single_node/run_1.3b.sh:33-34`:
Actor_Lr=9.65e-6
Critic_Lr=5e-6
EMA and sequence settings from `run_1.3b.sh:46-47,59`:
--max_answer_seq_len 256 \
--max_prompt_seq_len 256 \
--enable_ema \
Performance calculation factor from `dschat/utils/perf.py:19`:
checkpoint_activations_factor = 4 if args.gradient_checkpointing else 3
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Create_HF_Model
- Implementation:Microsoft_DeepSpeedExamples_Create_Critic_Model
- Implementation:Microsoft_DeepSpeedExamples_DeepSpeedPPOTrainer
- Principle:Microsoft_DeepSpeedExamples_Supervised_Fine_Tuning
- Principle:Microsoft_DeepSpeedExamples_Reward_Model_Training
- Principle:Microsoft_DeepSpeedExamples_PPO_Training