Principle:Microsoft DeepSpeedExamples RLHF Engine Initialization

Sources

Paper: InstructGPT — arXiv:2203.02155
Paper: Proximal Policy Optimization Algorithms — arXiv:1707.06347

Domains

NLP
RLHF
Distributed_Training

Overview

An initialization pattern that sets up the four-model architecture (actor, reference, critic, reward) required for Proximal Policy Optimization in language model alignment.

Description

Reinforcement Learning from Human Feedback (RLHF) requires four models to be loaded and managed simultaneously during training:

Actor (pi_theta) — The policy model that generates text. This is the model being actively fine-tuned. It is initialized from a supervised fine-tuning (SFT) checkpoint and equipped with an optimizer and learning rate scheduler.
Reference (pi_ref) — A frozen copy of the actor at the start of training. It is used to compute the KL divergence penalty that prevents the actor from diverging too far from the original policy.
Critic (V_phi) — A value network that predicts the expected return for each token position. It is initialized from a reward model checkpoint and trained alongside the actor using its own optimizer and learning rate scheduler.
Reward (R) — A frozen reward model that provides scalar reward signals for generated sequences. It shares the same architecture and checkpoint origin as the critic but receives no gradient updates.

Each of these models can be assigned a different ZeRO optimization stage to balance memory usage and communication overhead. For example, a common configuration uses ZeRO Stage 3 for the actor (the largest model, requiring full parameter sharding) and ZeRO Stage 0 for the reward model (frozen, requiring no optimizer states). If the actor uses ZeRO-3, the reference model also uses ZeRO-3 to ensure consistent parameter gathering; otherwise, the reference defaults to ZeRO-0.

The engine pattern also supports:

LoRA (Low-Rank Adaptation) for the actor and critic, reducing the number of trainable parameters.
Exponential Moving Average (EMA) of actor weights for stabilized evaluation.
CPU offloading for optimizer states and reference/reward model parameters.
Hybrid engine with tensor parallelism for inference during generation.

Theoretical Basis

PPO applied to language model alignment requires the following components operating simultaneously:

Component	Symbol	Role	Trainable
Actor	`pi_theta`	Generates text sequences given prompts	Yes
Reference	`pi_ref`	Computes KL divergence penalty `KL(pi_theta`	pi_ref)	No (frozen)
Critic	`V_phi`	Estimates advantage function `A_t = R_t - V_phi(s_t)`	Yes
Reward	`R`	Provides scalar reward signal for complete sequences	No (frozen)

The reward for each generated token is augmented with a KL penalty:

r_t = R(x, y) - beta * KL(pi_theta || pi_ref)

where beta is a coefficient controlling the strength of the KL penalty. This formulation ensures the actor does not deviate excessively from the reference policy, which could lead to reward hacking or degenerate outputs.

The initialization step must configure each model with the appropriate:

DeepSpeed configuration (ZeRO stage, offloading, mixed precision)
Optimizer (only for trainable models: actor and critic)
Learning rate scheduler (only for trainable models)
Gradient checkpointing (optional, for memory savings)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment