Principle:Hiyouga LLaMA Factory Proximal Policy Optimization

Knowledge Sources	Hiyouga_LLaMA_Factory Proximal Policy Optimization Algorithms Training language models to follow instructions with human feedback
Domains	Reinforcement Learning, Natural Language Processing, Language Model Alignment
Last Updated	2026-02-06 19:00 GMT

Overview

A reinforcement learning from human feedback (RLHF) training paradigm that uses a learned reward model and the proximal policy optimization algorithm to align language model outputs with human preferences.

Description

Proximal Policy Optimization (PPO) applied to language model alignment is the canonical RLHF approach popularized by OpenAI's InstructGPT (Ouyang et al., 2022). The method involves three models working together: a policy model that generates responses, a reward model that scores those responses, and a reference model that provides a KL-divergence anchor to prevent the policy from diverging too far from the original pretrained distribution.

The PPO-based RLHF pipeline operates in an online fashion:

The policy model generates responses to prompts from the training dataset.
The reward model scores each generated response.
The PPO algorithm updates the policy to maximize the reward while staying close to the reference model.

This approach is more complex than offline methods like DPO but offers several advantages:

Online learning: The model learns from its own generations rather than a fixed dataset, enabling exploration.
Flexible reward signal: The reward model can encode complex, non-decomposable preferences.
Proven effectiveness: PPO-based RLHF has been demonstrated at scale in production systems (ChatGPT, Claude).
Value head: A learned value function estimates expected future rewards, reducing variance in policy gradient estimates.

The main challenges include training instability, computational cost (multiple models in memory), and sensitivity to hyperparameters.

Usage

Use PPO-based RLHF when you want to:

Align a language model using a separately trained reward model.
Perform online exploration where the model learns from its own generated responses.
Apply sophisticated reward shaping that goes beyond pairwise preferences.
Follow the classic InstructGPT/ChatGPT alignment pipeline.

PPO is most appropriate when you have a reliable reward model, sufficient compute for multi-model training, and the need for online policy improvement.

Theoretical Basis

RLHF Objective

The RLHF objective maximizes expected reward while constraining the KL divergence from the reference policy:

$\max_{π_{θ}} 𝔼_{x \sim 𝒟, y \sim π_{θ} (\cdot ∣ x)} [R (x, y) - β KL (π_{θ} (\cdot ∣ x) ‖ π_{ref} (\cdot ∣ x))]$

where $R (x, y)$ is the reward model score, $π_{θ}$ is the policy, $π_{ref}$ is the reference model, and $β$ is the KL penalty coefficient.

PPO Clipped Objective

PPO optimizes a clipped surrogate objective to ensure stable updates:

$ℒ^{CLIP} (θ) = 𝔼_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]$

where $r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})}$ is the probability ratio, ${\hat{A}}_{t}$ is the estimated advantage, and $ϵ$ is the clipping parameter. The clipping prevents excessively large policy updates that could destabilize training.

Generalized Advantage Estimation

The advantage function is estimated using Generalized Advantage Estimation (GAE):

${\hat{A}}_{t} = \sum_{l = 0}^{T - t} (γ λ)^{l} δ_{t + l}$

where $δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})$ is the TD residual, $γ$ is the discount factor, and $λ$ is the GAE parameter. In the language model setting, the value function $V (s_{t})$ is provided by a value head -- a linear layer appended to the language model that estimates the expected cumulative reward from each token position.

Reward Computation

The reward for a generated response is obtained from:

Full reward model: A separate model with a value head that outputs a scalar reward at the last token position.
LoRA reward model: The same base model with a separate LoRA adapter switched in to provide reward scores.
API reward model: An external reward service that scores generated text.

Rewards can optionally be whitened (normalized to zero mean and unit variance) or score-normalized to stabilize training.

Training Loop

Each PPO step involves:

Generation phase: The policy generates responses for a buffer of prompts.
Reward phase: The reward model scores each generated response.
Optimization phase: Multiple PPO epochs update the policy on the collected buffer, using the clipped objective and value function loss.

The buffer size, mini-batch size, and number of PPO epochs per buffer are key hyperparameters that control the trade-off between sample efficiency and computational cost.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment