Principle:LaurentMazare Tch rs Proximal Policy Optimization

Knowledge Sources	LaurentMazare_Tch_rs Proximal Policy Optimization Algorithms
Domains	Reinforcement Learning, Deep Learning
Last Updated	2026-02-08 00:00 GMT

Overview

Proximal Policy Optimization constrains policy updates by clipping the importance sampling ratio, enabling multiple optimization epochs per batch of experience while preventing destructively large policy changes.

Description

PPO is a policy gradient algorithm designed to achieve the sample efficiency of trust-region methods (like TRPO) while being much simpler to implement. Its core innovations are:

Clipped surrogate objective: PPO uses an importance sampling ratio $r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}$ to reuse data collected under the old policy. The key innovation is clipping this ratio to the interval $[1 - ϵ, 1 + ϵ]$ , which removes the incentive for the policy to move too far from the data-collection policy. When the advantage is positive, the ratio is clipped from above (preventing excessive increase in action probability); when the advantage is negative, the ratio is clipped from below (preventing excessive decrease).

Multiple epochs per batch: Unlike vanilla policy gradient methods that use each batch of experience exactly once, PPO performs multiple optimization epochs over the same batch. The clipping mechanism ensures that these repeated updates do not cause the policy to deviate too far from the behavior policy, maintaining the validity of the importance sampling approximation.

Combined objective: The total loss function combines the clipped policy surrogate, a value function loss (typically mean squared error between predicted and target values), and an entropy bonus for exploration:

Generalized Advantage Estimation (GAE): PPO typically uses GAE to compute advantage estimates, which provides a smooth interpolation between high-bias (low-variance) and low-bias (high-variance) advantage estimators via a parameter $λ$ .

Usage

PPO is widely used as a default algorithm for continuous and discrete control tasks, game playing, robotics, and fine-tuning language models with reinforcement learning from human feedback (RLHF). Its simplicity, stability, and strong empirical performance make it one of the most popular RL algorithms in practice.

Theoretical Basis

Importance Sampling Ratio:

$r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}$

Clipped Surrogate Objective:

$ℒ^{C L I P} (θ) = 𝔼_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]$

where $ϵ$ is the clipping parameter (typically 0.1 or 0.2) and ${\hat{A}}_{t}$ is the estimated advantage.

The clipping behavior is:

$clip (r, 1 - ϵ, 1 + ϵ) = {\begin{cases} 1 - ϵ & if r < 1 - ϵ \\ r & if 1 - ϵ \leq r \leq 1 + ϵ \\ 1 + ϵ & if r > 1 + ϵ \end{cases}$

Generalized Advantage Estimation (GAE):

${\hat{A}}_{t}^{G A E (γ, λ)} = \sum_{l = 0}^{\infty} (γ λ)^{l} δ_{t + l}$

where the TD residual is:

$δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})$

Setting $λ = 1$ recovers the Monte Carlo advantage; $λ = 0$ gives the one-step TD advantage.

Value Function Loss:

$ℒ^{V F} (ϕ) = 𝔼_{t} [(V_{ϕ} (s_{t}) - V_{t}^{t a r g e t})^{2}]$

Entropy Bonus:

$S [π_{θ}] (s_{t}) = - \sum_{a} π_{θ} (a | s_{t}) \log π_{θ} (a | s_{t})$

Combined PPO Objective:

$ℒ (θ, ϕ) = 𝔼_{t} [ℒ_{t}^{C L I P} (θ) - c_{1} ℒ_{t}^{V F} (ϕ) + c_{2} S [π_{θ}] (s_{t})]$

PPO Training Loop:

for each iteration:
    collect T steps from N parallel environments using pi(theta_old)
    compute advantages using GAE
    for epoch = 1 to K:
        for each mini-batch in collected data:
            compute clipped surrogate, value loss, entropy
            update theta via gradient ascent on combined objective
    theta_old := theta

Related Pages

Implementation:LaurentMazare_Tch_rs_PPO_Agent

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment