Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft DeepSpeedExamples PPO Training

From Leeroopedia


Sources

  • Paper: Proximal Policy Optimization AlgorithmsarXiv:1707.06347
  • Paper: InstructGPT — Training language models to follow instructions with human feedbackarXiv:2203.02155

Domains

  • NLP
  • RLHF
  • Reinforcement_Learning

Overview

A reinforcement learning algorithm that fine-tunes language models using clipped surrogate objectives with KL divergence penalties from a reference policy.

Description

Proximal Policy Optimization (PPO) adapted for RLHF operates in a two-phase loop that alternates between experience generation and policy optimization:

Phase 1: Experience Generation

  1. The actor model generates text sequences given a batch of prompts.
  2. All four models process the generated sequences:
    • The actor computes log-probabilities of the generated tokens.
    • The reference computes log-probabilities for the same tokens (used for KL penalty).
    • The critic predicts per-token value estimates.
    • The reward model assigns a scalar reward score to each complete sequence.
  3. The per-token reward is computed by combining the reward signal with a KL divergence penalty between the actor and reference log-probabilities.
  4. Generalized Advantage Estimation (GAE) computes advantages and returns from the values and rewards.

Phase 2: Policy Optimization

  1. The experience is split into mini-batches.
  2. For each mini-batch:
    • The actor recomputes log-probabilities on the generated sequences and computes the clipped surrogate loss.
    • The critic recomputes value predictions and computes the clipped value loss.
    • Both losses are backpropagated and optimizer steps are taken.

The training loop runs for a configurable number of PPO epochs per batch of experience, with optional overflow alignment between actor and critic gradients.

Theoretical Basis

Actor Loss (Clipped Surrogate Objective)

The actor is trained to maximize the clipped surrogate objective:

L_actor = -min(r_t * A_t, clip(r_t, 1 - epsilon, 1 + epsilon) * A_t)

where:

  • r_t = exp(log pi_theta(a_t|s_t) - log pi_old(a_t|s_t)) is the probability ratio between the current and old policy
  • A_t is the advantage estimated via GAE
  • epsilon is the clipping parameter (default 0.2)

The clipping prevents destructively large policy updates by bounding the ratio to the interval [1 - epsilon, 1 + epsilon].

Critic Loss (Clipped Value Loss)

The critic is trained to minimize the clipped value loss:

L_critic = 0.5 * max((V - R_returns)^2, (clip(V, V_old - epsilon_v, V_old + epsilon_v) - R_returns)^2)

where:

  • V is the current value prediction
  • V_old is the value prediction from the experience generation phase
  • R_returns are the computed returns (advantages + values)
  • epsilon_v is the value clipping parameter (default 0.2)

Reward with KL Penalty

The per-token reward incorporates a KL divergence penalty to prevent the actor from diverging too far from the reference policy:

r_t = -beta * (log pi_theta(a_t|s_t) - log pi_ref(a_t|s_t))

At the final token of each sequence, the clipped reward model score is added:

r_T = r_T + clip(R(x, y), -c, c)

where:

  • beta is the KL penalty coefficient (default 0.1)
  • c is the reward clipping value (default 5)

Generalized Advantage Estimation (GAE)

Advantages are computed using GAE with discount factor gamma (default 1.0) and GAE parameter lambda (default 0.95):

delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)

A_t = sum_{l=0}^{T-t} (gamma * lambda)^l * delta_{t+l}

Key Hyperparameters

Parameter Default Description
kl_ctl (beta) 0.1 KL penalty coefficient controlling divergence from reference
cliprange (epsilon) 0.2 Clipping range for the actor surrogate objective
cliprange_value (epsilon_v) 0.2 Clipping range for the critic value loss
clip_reward_value (c) 5.0 Maximum absolute reward value from the reward model
gamma 1.0 Discount factor for reward accumulation
lam (lambda) 0.95 GAE lambda parameter for advantage estimation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment