Principle:Microsoft DeepSpeedExamples RLHF Engine Initialization
Sources
- Paper: InstructGPT — arXiv:2203.02155
- Paper: Proximal Policy Optimization Algorithms — arXiv:1707.06347
Domains
- NLP
- RLHF
- Distributed_Training
Overview
An initialization pattern that sets up the four-model architecture (actor, reference, critic, reward) required for Proximal Policy Optimization in language model alignment.
Description
Reinforcement Learning from Human Feedback (RLHF) requires four models to be loaded and managed simultaneously during training:
- Actor (
pi_theta) — The policy model that generates text. This is the model being actively fine-tuned. It is initialized from a supervised fine-tuning (SFT) checkpoint and equipped with an optimizer and learning rate scheduler. - Reference (
pi_ref) — A frozen copy of the actor at the start of training. It is used to compute the KL divergence penalty that prevents the actor from diverging too far from the original policy. - Critic (
V_phi) — A value network that predicts the expected return for each token position. It is initialized from a reward model checkpoint and trained alongside the actor using its own optimizer and learning rate scheduler. - Reward (
R) — A frozen reward model that provides scalar reward signals for generated sequences. It shares the same architecture and checkpoint origin as the critic but receives no gradient updates.
Each of these models can be assigned a different ZeRO optimization stage to balance memory usage and communication overhead. For example, a common configuration uses ZeRO Stage 3 for the actor (the largest model, requiring full parameter sharding) and ZeRO Stage 0 for the reward model (frozen, requiring no optimizer states). If the actor uses ZeRO-3, the reference model also uses ZeRO-3 to ensure consistent parameter gathering; otherwise, the reference defaults to ZeRO-0.
The engine pattern also supports:
- LoRA (Low-Rank Adaptation) for the actor and critic, reducing the number of trainable parameters.
- Exponential Moving Average (EMA) of actor weights for stabilized evaluation.
- CPU offloading for optimizer states and reference/reward model parameters.
- Hybrid engine with tensor parallelism for inference during generation.
Theoretical Basis
PPO applied to language model alignment requires the following components operating simultaneously:
| Component | Symbol | Role | Trainable | |
|---|---|---|---|---|
| Actor | pi_theta |
Generates text sequences given prompts | Yes | |
| Reference | pi_ref |
Computes KL divergence penalty KL(pi_theta |
pi_ref) | No (frozen) |
| Critic | V_phi |
Estimates advantage function A_t = R_t - V_phi(s_t) |
Yes | |
| Reward | R |
Provides scalar reward signal for complete sequences | No (frozen) |
The reward for each generated token is augmented with a KL penalty:
r_t = R(x, y) - beta * KL(pi_theta || pi_ref)
where beta is a coefficient controlling the strength of the KL penalty. This formulation ensures the actor does not deviate excessively from the reference policy, which could lead to reward hacking or degenerate outputs.
The initialization step must configure each model with the appropriate:
- DeepSpeed configuration (ZeRO stage, offloading, mixed precision)
- Optimizer (only for trainable models: actor and critic)
- Learning rate scheduler (only for trainable models)
- Gradient checkpointing (optional, for memory savings)