Principle:Microsoft DeepSpeedExamples PPO Training

Sources

Paper: Proximal Policy Optimization Algorithms — arXiv:1707.06347
Paper: InstructGPT — Training language models to follow instructions with human feedback — arXiv:2203.02155

Domains

NLP
RLHF
Reinforcement_Learning

Overview

A reinforcement learning algorithm that fine-tunes language models using clipped surrogate objectives with KL divergence penalties from a reference policy.

Description

Proximal Policy Optimization (PPO) adapted for RLHF operates in a two-phase loop that alternates between experience generation and policy optimization:

Phase 1: Experience Generation

The actor model generates text sequences given a batch of prompts.
All four models process the generated sequences:
- The actor computes log-probabilities of the generated tokens.
- The reference computes log-probabilities for the same tokens (used for KL penalty).
- The critic predicts per-token value estimates.
- The reward model assigns a scalar reward score to each complete sequence.
The per-token reward is computed by combining the reward signal with a KL divergence penalty between the actor and reference log-probabilities.
Generalized Advantage Estimation (GAE) computes advantages and returns from the values and rewards.

Phase 2: Policy Optimization

The experience is split into mini-batches.
For each mini-batch:
- The actor recomputes log-probabilities on the generated sequences and computes the clipped surrogate loss.
- The critic recomputes value predictions and computes the clipped value loss.
- Both losses are backpropagated and optimizer steps are taken.

The training loop runs for a configurable number of PPO epochs per batch of experience, with optional overflow alignment between actor and critic gradients.

Theoretical Basis

Actor Loss (Clipped Surrogate Objective)

The actor is trained to maximize the clipped surrogate objective:

L_actor = -min(r_t * A_t, clip(r_t, 1 - epsilon, 1 + epsilon) * A_t)

where:

r_t = exp(log pi_theta(a_t|s_t) - log pi_old(a_t|s_t)) is the probability ratio between the current and old policy
A_t is the advantage estimated via GAE
epsilon is the clipping parameter (default 0.2)

The clipping prevents destructively large policy updates by bounding the ratio to the interval [1 - epsilon, 1 + epsilon].

Critic Loss (Clipped Value Loss)

The critic is trained to minimize the clipped value loss:

L_critic = 0.5 * max((V - R_returns)^2, (clip(V, V_old - epsilon_v, V_old + epsilon_v) - R_returns)^2)

where:

V is the current value prediction
V_old is the value prediction from the experience generation phase
R_returns are the computed returns (advantages + values)
epsilon_v is the value clipping parameter (default 0.2)

Reward with KL Penalty

The per-token reward incorporates a KL divergence penalty to prevent the actor from diverging too far from the reference policy:

r_t = -beta * (log pi_theta(a_t|s_t) - log pi_ref(a_t|s_t))

At the final token of each sequence, the clipped reward model score is added:

r_T = r_T + clip(R(x, y), -c, c)

where:

beta is the KL penalty coefficient (default 0.1)
c is the reward clipping value (default 5)

Generalized Advantage Estimation (GAE)

Advantages are computed using GAE with discount factor gamma (default 1.0) and GAE parameter lambda (default 0.95):

delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)

A_t = sum_{l=0}^{T-t} (gamma * lambda)^l * delta_{t+l}

Key Hyperparameters

Parameter	Default	Description
`kl_ctl (beta)`	0.1	KL penalty coefficient controlling divergence from reference
`cliprange (epsilon)`	0.2	Clipping range for the actor surrogate objective
`cliprange_value (epsilon_v)`	0.2	Clipping range for the critic value loss
`clip_reward_value (c)`	5.0	Maximum absolute reward value from the reward model
`gamma`	1.0	Discount factor for reward accumulation
`lam (lambda)`	0.95	GAE lambda parameter for advantage estimation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment