Principle:Haosulab ManiSkill PPO Policy Optimization

Field	Value
principle_name	Haosulab_ManiSkill_PPO_Policy_Optimization
overview	Proximal Policy Optimization with clipped surrogate objective, Generalized Advantage Estimation, and value function fitting
domains	Reinforcement_Learning
last_updated	2026-02-15
related_pages	Implementation:Haosulab_ManiSkill_PPO_Training_Loop

Overview

Description

Proximal Policy Optimization (PPO) is an on-policy reinforcement learning algorithm that updates the policy using data collected from the current policy. It addresses the fundamental challenge of policy gradient methods: how to take the largest possible improvement step without causing the policy to degrade. PPO achieves this through a clipped surrogate objective that prevents excessively large policy updates.

The optimization phase in the ManiSkill PPO pipeline consists of three main components:

1. Generalized Advantage Estimation (GAE)

After collecting a rollout of transitions, advantages are computed using GAE, which provides a bias-variance tradeoff between Monte Carlo returns (low bias, high variance) and TD(0) estimates (high bias, low variance). GAE computes a weighted sum of n-step TD errors:

The temporal difference error at step t is:

delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)

The GAE advantage is computed recursively:

A_t = delta_t + (gamma * lambda) * (1 - done_{t+1}) * A_{t+1}

where gamma is the discount factor and lambda is the GAE lambda parameter. The (1 - done) term cuts the advantage propagation at episode boundaries.

Returns are then: G_t = A_t + V(s_t)

2. Clipped Surrogate Objective

The core PPO loss uses an importance sampling ratio between the new and old policies:

r_t = pi_new(a_t|s_t) / pi_old(a_t|s_t) = exp(log_prob_new - log_prob_old)

The clipped surrogate loss is:

L^CLIP = E[ min(r_t * A_t, clip(r_t, 1 - epsilon, 1 + epsilon) * A_t) ]

This formulation ensures that when the advantage is positive (good action), the ratio is clipped from above at 1 + epsilon, preventing overly optimistic updates. When the advantage is negative (bad action), the ratio is clipped from below at 1 - epsilon. The clipping coefficient epsilon (typically 0.2) controls the maximum policy change per update.

3. Combined Loss Function

The total loss combines three terms:

L = L^CLIP - c_ent * H(pi) + c_vf * L^VF

where:

L^CLIP: Clipped surrogate policy loss (to be minimized, so negated)
H(pi): Entropy bonus (encourages exploration; subtracted because we minimize the loss)
L^VF: Value function loss (MSE between critic predictions and computed returns)
c_ent: Entropy coefficient (default: 0.0 in ManiSkill PPO)
c_vf: Value function coefficient (default: 0.5)

4. Minibatch Optimization

The collected batch of data is split into minibatches and optimized over multiple epochs. Each epoch shuffles the data and processes it in minibatches to improve sample efficiency while maintaining update stability.

Usage

Use PPO policy optimization as the update phase of on-policy RL training. It is applied after rollout collection and advantage computation:

Collect rollout data with the current policy (see GPU Parallelized Rollout)
Compute GAE advantages and returns
For update_epochs epochs:
- Shuffle the batch
- For each minibatch:
  - Re-evaluate actions under the current policy (get new log-probs, entropy, values)
  - Compute the importance sampling ratio and clipped surrogate loss
  - Compute value loss and entropy bonus
  - Backpropagate the combined loss
  - Clip gradients and update parameters
Optionally check KL divergence for early stopping

Theoretical Basis

PPO Clipped Surrogate Objective (Schulman et al., 2017): The key insight of PPO is that the clipped objective provides a first-order approximation to the trust region constraint of TRPO. By clipping the probability ratio, PPO avoids the computational expense of second-order optimization while achieving similar performance stability. The clip coefficient epsilon = 0.2 means the policy can change by at most ~20% per update in any direction.

Generalized Advantage Estimation (Schulman et al., 2015): GAE provides a family of advantage estimators parameterized by lambda in [0, 1]:

lambda = 0: TD(0) advantage (low variance, high bias) -- A_t = delta_t
lambda = 1: Monte Carlo advantage (high variance, low bias) -- A_t = sum of discounted delta_t
0 < lambda < 1: Interpolates between the two extremes

The ManiSkill PPO baseline uses lambda = 0.9, which leans toward the Monte Carlo end but with significant variance reduction.

Advantage Normalization: Per-minibatch advantage normalization ((A - mean(A)) / (std(A) + eps)) stabilizes training by ensuring gradients have consistent scale regardless of the reward magnitude.

KL Divergence Early Stopping: An optional safety mechanism that halts optimization epochs when the KL divergence between old and new policies exceeds a threshold (target_kl = 0.1). This prevents catastrophic policy updates even when the clipped objective alone is insufficient.

Gradient Clipping: The maximum gradient norm (max_grad_norm = 0.5) prevents exploding gradients during backpropagation, which can occur when the policy is far from optimal or the reward signal is noisy.

Key Hyperparameters for ManiSkill PPO
Parameter	Default Value	Description
learning_rate	3e-4	Adam optimizer learning rate
gamma	0.8	Discount factor (shorter horizon than typical RL due to dense rewards)
gae_lambda	0.9	GAE lambda parameter for advantage estimation
clip_coef	0.2	PPO clipping coefficient (epsilon)
update_epochs	4	Number of optimization epochs per rollout batch
num_minibatches	32	Number of minibatches per epoch
vf_coef	0.5	Value function loss coefficient
ent_coef	0.0	Entropy bonus coefficient (no entropy bonus by default)
max_grad_norm	0.5	Maximum gradient norm for clipping
target_kl	0.1	KL divergence threshold for early stopping
reward_scale	1.0	Multiplier applied to rewards before optimization
num_steps	50	Rollout length per environment per iteration
norm_adv	True	Whether to normalize advantages per minibatch

Note on gamma=0.8: The ManiSkill PPO baseline uses a relatively low discount factor compared to typical RL settings (gamma=0.99). This is because ManiSkill tasks often use dense reward functions with relatively short horizons (50-200 steps), where a lower gamma helps the agent focus on immediate rewards and converge faster.

Related Pages

Implementation:Haosulab_ManiSkill_PPO_Training_Loop -- Concrete implementation of the PPO optimization loop
Principle:Haosulab_ManiSkill_GPU_Parallelized_Rollout -- How rollout data is collected before optimization
Principle:Haosulab_ManiSkill_PPO_Agent_Architecture -- The neural network architecture being optimized
Principle:Haosulab_ManiSkill_RL_Evaluation_Checkpointing -- How optimization progress is measured

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment