Principle:Volcengine Verl SAPO Algorithm

Knowledge Sources	SAPO Volcengine_Verl
Domains	Reinforcement_Learning, Policy_Optimization, RLHF
Last Updated	2026-02-07 18:00 GMT

Overview

A policy optimization algorithm that replaces PPO's hard clipping with smooth exponential advantage weighting controlled by asymmetric temperature parameters.

Description

Smooth Advantage Policy Optimization (SAPO) is a reinforcement learning algorithm that addresses a limitation of PPO's clipped surrogate objective. Instead of hard-clipping the probability ratio to [1-epsilon, 1+epsilon], SAPO applies a smooth exponential function to the advantage, using two separate temperature parameters:

tau_pos — Controls the weight of positive advantages (good actions)
tau_neg — Controls the weight of negative advantages (bad actions)

When tau_neg > tau_pos (the recommended setting from the paper), the algorithm applies stronger correction to poor actions than reward to good ones, creating an asymmetric optimization pressure. This avoids the discontinuity of PPO clipping while maintaining stable policy updates.

SAPO is implemented in verl by setting loss_mode=sapo in the actor configuration, which switches the policy loss computation from clipped surrogate to smooth exponential weighting.

Usage

Use this principle when training language models with RL and you want smoother optimization dynamics than PPO clipping provides. It is particularly suited for large MoE models (e.g., Qwen3-30B-A3B) where the discontinuity of PPO clipping can cause unstable training.

Theoretical Basis

The SAPO loss replaces PPO's clipped objective with:

Pseudo-code Logic:

# Abstract algorithm (NOT real implementation)
ratio = new_policy_prob / old_policy_prob
advantage = compute_advantage(rewards, values)

# Instead of PPO clipping:
#   clipped_ratio = clip(ratio, 1-eps, 1+eps)
#   loss = -min(ratio * advantage, clipped_ratio * advantage)

# SAPO uses smooth exponential weighting:
if advantage >= 0:
    weight = exp(advantage / tau_pos)
else:
    weight = exp(advantage / tau_neg)

loss = -weight * ratio * advantage

The key insight is that with tau_neg > tau_pos, negative advantages receive stronger gradients, making the policy more conservative about generating low-quality outputs.

Related Pages

Implementation:Volcengine_Verl_SAPO_Training_Script

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment