Principle:Microsoft DeepSpeedExamples PPO Training
Sources
- Paper: Proximal Policy Optimization Algorithms — arXiv:1707.06347
- Paper: InstructGPT — Training language models to follow instructions with human feedback — arXiv:2203.02155
Domains
- NLP
- RLHF
- Reinforcement_Learning
Overview
A reinforcement learning algorithm that fine-tunes language models using clipped surrogate objectives with KL divergence penalties from a reference policy.
Description
Proximal Policy Optimization (PPO) adapted for RLHF operates in a two-phase loop that alternates between experience generation and policy optimization:
Phase 1: Experience Generation
- The actor model generates text sequences given a batch of prompts.
- All four models process the generated sequences:
- The actor computes log-probabilities of the generated tokens.
- The reference computes log-probabilities for the same tokens (used for KL penalty).
- The critic predicts per-token value estimates.
- The reward model assigns a scalar reward score to each complete sequence.
- The per-token reward is computed by combining the reward signal with a KL divergence penalty between the actor and reference log-probabilities.
- Generalized Advantage Estimation (GAE) computes advantages and returns from the values and rewards.
Phase 2: Policy Optimization
- The experience is split into mini-batches.
- For each mini-batch:
- The actor recomputes log-probabilities on the generated sequences and computes the clipped surrogate loss.
- The critic recomputes value predictions and computes the clipped value loss.
- Both losses are backpropagated and optimizer steps are taken.
The training loop runs for a configurable number of PPO epochs per batch of experience, with optional overflow alignment between actor and critic gradients.
Theoretical Basis
Actor Loss (Clipped Surrogate Objective)
The actor is trained to maximize the clipped surrogate objective:
L_actor = -min(r_t * A_t, clip(r_t, 1 - epsilon, 1 + epsilon) * A_t)
where:
r_t = exp(log pi_theta(a_t|s_t) - log pi_old(a_t|s_t))is the probability ratio between the current and old policyA_tis the advantage estimated via GAEepsilonis the clipping parameter (default 0.2)
The clipping prevents destructively large policy updates by bounding the ratio to the interval [1 - epsilon, 1 + epsilon].
Critic Loss (Clipped Value Loss)
The critic is trained to minimize the clipped value loss:
L_critic = 0.5 * max((V - R_returns)^2, (clip(V, V_old - epsilon_v, V_old + epsilon_v) - R_returns)^2)
where:
Vis the current value predictionV_oldis the value prediction from the experience generation phaseR_returnsare the computed returns (advantages + values)epsilon_vis the value clipping parameter (default 0.2)
Reward with KL Penalty
The per-token reward incorporates a KL divergence penalty to prevent the actor from diverging too far from the reference policy:
r_t = -beta * (log pi_theta(a_t|s_t) - log pi_ref(a_t|s_t))
At the final token of each sequence, the clipped reward model score is added:
r_T = r_T + clip(R(x, y), -c, c)
where:
betais the KL penalty coefficient (default 0.1)cis the reward clipping value (default 5)
Generalized Advantage Estimation (GAE)
Advantages are computed using GAE with discount factor gamma (default 1.0) and GAE parameter lambda (default 0.95):
delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)
A_t = sum_{l=0}^{T-t} (gamma * lambda)^l * delta_{t+l}
Key Hyperparameters
| Parameter | Default | Description |
|---|---|---|
kl_ctl (beta) |
0.1 | KL penalty coefficient controlling divergence from reference |
cliprange (epsilon) |
0.2 | Clipping range for the actor surrogate objective |
cliprange_value (epsilon_v) |
0.2 | Clipping range for the critic value loss |
clip_reward_value (c) |
5.0 | Maximum absolute reward value from the reward model |
gamma |
1.0 | Discount factor for reward accumulation |
lam (lambda) |
0.95 | GAE lambda parameter for advantage estimation |