Principle:Haosulab ManiSkill PPO Policy Optimization
| Field | Value |
|---|---|
| principle_name | Haosulab_ManiSkill_PPO_Policy_Optimization |
| overview | Proximal Policy Optimization with clipped surrogate objective, Generalized Advantage Estimation, and value function fitting |
| domains | Reinforcement_Learning |
| last_updated | 2026-02-15 |
| related_pages | Implementation:Haosulab_ManiSkill_PPO_Training_Loop |
Overview
Description
Proximal Policy Optimization (PPO) is an on-policy reinforcement learning algorithm that updates the policy using data collected from the current policy. It addresses the fundamental challenge of policy gradient methods: how to take the largest possible improvement step without causing the policy to degrade. PPO achieves this through a clipped surrogate objective that prevents excessively large policy updates.
The optimization phase in the ManiSkill PPO pipeline consists of three main components:
1. Generalized Advantage Estimation (GAE)
After collecting a rollout of transitions, advantages are computed using GAE, which provides a bias-variance tradeoff between Monte Carlo returns (low bias, high variance) and TD(0) estimates (high bias, low variance). GAE computes a weighted sum of n-step TD errors:
The temporal difference error at step t is:
delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)
The GAE advantage is computed recursively:
A_t = delta_t + (gamma * lambda) * (1 - done_{t+1}) * A_{t+1}
where gamma is the discount factor and lambda is the GAE lambda parameter. The (1 - done) term cuts the advantage propagation at episode boundaries.
Returns are then: G_t = A_t + V(s_t)
2. Clipped Surrogate Objective
The core PPO loss uses an importance sampling ratio between the new and old policies:
r_t = pi_new(a_t|s_t) / pi_old(a_t|s_t) = exp(log_prob_new - log_prob_old)
The clipped surrogate loss is:
L^CLIP = E[ min(r_t * A_t, clip(r_t, 1 - epsilon, 1 + epsilon) * A_t) ]
This formulation ensures that when the advantage is positive (good action), the ratio is clipped from above at 1 + epsilon, preventing overly optimistic updates. When the advantage is negative (bad action), the ratio is clipped from below at 1 - epsilon. The clipping coefficient epsilon (typically 0.2) controls the maximum policy change per update.
3. Combined Loss Function
The total loss combines three terms:
L = L^CLIP - c_ent * H(pi) + c_vf * L^VF
where:
L^CLIP: Clipped surrogate policy loss (to be minimized, so negated)H(pi): Entropy bonus (encourages exploration; subtracted because we minimize the loss)L^VF: Value function loss (MSE between critic predictions and computed returns)c_ent: Entropy coefficient (default: 0.0 in ManiSkill PPO)c_vf: Value function coefficient (default: 0.5)
4. Minibatch Optimization
The collected batch of data is split into minibatches and optimized over multiple epochs. Each epoch shuffles the data and processes it in minibatches to improve sample efficiency while maintaining update stability.
Usage
Use PPO policy optimization as the update phase of on-policy RL training. It is applied after rollout collection and advantage computation:
- Collect rollout data with the current policy (see GPU Parallelized Rollout)
- Compute GAE advantages and returns
- For
update_epochsepochs:- Shuffle the batch
- For each minibatch:
- Re-evaluate actions under the current policy (get new log-probs, entropy, values)
- Compute the importance sampling ratio and clipped surrogate loss
- Compute value loss and entropy bonus
- Backpropagate the combined loss
- Clip gradients and update parameters
- Optionally check KL divergence for early stopping
Theoretical Basis
PPO Clipped Surrogate Objective (Schulman et al., 2017): The key insight of PPO is that the clipped objective provides a first-order approximation to the trust region constraint of TRPO. By clipping the probability ratio, PPO avoids the computational expense of second-order optimization while achieving similar performance stability. The clip coefficient epsilon = 0.2 means the policy can change by at most ~20% per update in any direction.
Generalized Advantage Estimation (Schulman et al., 2015): GAE provides a family of advantage estimators parameterized by lambda in [0, 1]:
lambda = 0: TD(0) advantage (low variance, high bias) --A_t = delta_tlambda = 1: Monte Carlo advantage (high variance, low bias) --A_t = sum of discounted delta_t0 < lambda < 1: Interpolates between the two extremes
The ManiSkill PPO baseline uses lambda = 0.9, which leans toward the Monte Carlo end but with significant variance reduction.
Advantage Normalization: Per-minibatch advantage normalization ((A - mean(A)) / (std(A) + eps)) stabilizes training by ensuring gradients have consistent scale regardless of the reward magnitude.
KL Divergence Early Stopping: An optional safety mechanism that halts optimization epochs when the KL divergence between old and new policies exceeds a threshold (target_kl = 0.1). This prevents catastrophic policy updates even when the clipped objective alone is insufficient.
Gradient Clipping: The maximum gradient norm (max_grad_norm = 0.5) prevents exploding gradients during backpropagation, which can occur when the policy is far from optimal or the reward signal is noisy.
| Parameter | Default Value | Description |
|---|---|---|
| learning_rate | 3e-4 | Adam optimizer learning rate |
| gamma | 0.8 | Discount factor (shorter horizon than typical RL due to dense rewards) |
| gae_lambda | 0.9 | GAE lambda parameter for advantage estimation |
| clip_coef | 0.2 | PPO clipping coefficient (epsilon) |
| update_epochs | 4 | Number of optimization epochs per rollout batch |
| num_minibatches | 32 | Number of minibatches per epoch |
| vf_coef | 0.5 | Value function loss coefficient |
| ent_coef | 0.0 | Entropy bonus coefficient (no entropy bonus by default) |
| max_grad_norm | 0.5 | Maximum gradient norm for clipping |
| target_kl | 0.1 | KL divergence threshold for early stopping |
| reward_scale | 1.0 | Multiplier applied to rewards before optimization |
| num_steps | 50 | Rollout length per environment per iteration |
| norm_adv | True | Whether to normalize advantages per minibatch |
Note on gamma=0.8: The ManiSkill PPO baseline uses a relatively low discount factor compared to typical RL settings (gamma=0.99). This is because ManiSkill tasks often use dense reward functions with relatively short horizons (50-200 steps), where a lower gamma helps the agent focus on immediate rewards and converge faster.
Related Pages
- Implementation:Haosulab_ManiSkill_PPO_Training_Loop -- Concrete implementation of the PPO optimization loop
- Principle:Haosulab_ManiSkill_GPU_Parallelized_Rollout -- How rollout data is collected before optimization
- Principle:Haosulab_ManiSkill_PPO_Agent_Architecture -- The neural network architecture being optimized
- Principle:Haosulab_ManiSkill_RL_Evaluation_Checkpointing -- How optimization progress is measured