Principle:Volcengine Verl SAPO Algorithm
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Policy_Optimization, RLHF |
| Last Updated | 2026-02-07 18:00 GMT |
Overview
A policy optimization algorithm that replaces PPO's hard clipping with smooth exponential advantage weighting controlled by asymmetric temperature parameters.
Description
Smooth Advantage Policy Optimization (SAPO) is a reinforcement learning algorithm that addresses a limitation of PPO's clipped surrogate objective. Instead of hard-clipping the probability ratio to [1-epsilon, 1+epsilon], SAPO applies a smooth exponential function to the advantage, using two separate temperature parameters:
- tau_pos — Controls the weight of positive advantages (good actions)
- tau_neg — Controls the weight of negative advantages (bad actions)
When tau_neg > tau_pos (the recommended setting from the paper), the algorithm applies stronger correction to poor actions than reward to good ones, creating an asymmetric optimization pressure. This avoids the discontinuity of PPO clipping while maintaining stable policy updates.
SAPO is implemented in verl by setting loss_mode=sapo in the actor configuration, which switches the policy loss computation from clipped surrogate to smooth exponential weighting.
Usage
Use this principle when training language models with RL and you want smoother optimization dynamics than PPO clipping provides. It is particularly suited for large MoE models (e.g., Qwen3-30B-A3B) where the discontinuity of PPO clipping can cause unstable training.
Theoretical Basis
The SAPO loss replaces PPO's clipped objective with:
Pseudo-code Logic:
# Abstract algorithm (NOT real implementation)
ratio = new_policy_prob / old_policy_prob
advantage = compute_advantage(rewards, values)
# Instead of PPO clipping:
# clipped_ratio = clip(ratio, 1-eps, 1+eps)
# loss = -min(ratio * advantage, clipped_ratio * advantage)
# SAPO uses smooth exponential weighting:
if advantage >= 0:
weight = exp(advantage / tau_pos)
else:
weight = exp(advantage / tau_neg)
loss = -weight * ratio * advantage
The key insight is that with tau_neg > tau_pos, negative advantages receive stronger gradients, making the policy more conservative about generating low-quality outputs.