Implementation:Volcengine Verl SAPO Training Script

Knowledge Sources	Volcengine_Verl SAPO
Domains	Reinforcement_Learning, Training_Scripts, RLHF
Last Updated	2026-02-07 18:00 GMT

Overview

Concrete tool for running SAPO (Smooth Advantage Policy Optimization) training on Qwen3-30B-A3B-Base using the verl framework with Slurm multi-node orchestration.

Description

This shell script provides a complete, production-ready training configuration for the SAPO algorithm on the Qwen3-30B-A3B-Base model. SAPO replaces the standard PPO clipping mechanism with a smooth exponential advantage function controlled by positive and negative temperature parameters (tau_pos, tau_neg). The script handles:

Slurm job submission with multi-node Ray cluster initialization
Dataset preparation (DAPO-Math-17k for training, AIME-2024 for testing)
FSDP or Megatron-LM backend selection for distributed training
vLLM-based async rollout generation with configurable tensor/data/expert parallelism
DAPO reward manager with overlong buffer penalty configuration
WandB experiment tracking

Usage

Use this script when training a large MoE language model (Qwen3-30B-A3B) with the SAPO algorithm on a Slurm-managed GPU cluster. It serves as the reference example for SAPO configuration in verl, demonstrating how to set the smooth advantage parameters (tau_pos, tau_neg) and loss_mode=sapo.

Code Reference

Source Location

Repository: Volcengine_Verl
File: examples/sapo_trainer/run_qwen30b_sapo.sh
Lines: 1-373

Signature

#!/bin/bash
#SBATCH --job-name=sapo-30B
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --gres=gpu:8
#SBATCH --gpus-per-node=8

# Key algorithm parameters:
adv_estimator=grpo
loss_mode=sapo           # SAPO uses smoothing, not clipping
tau_pos=1.0              # Positive advantage temperature
tau_neg=1.05             # Negative advantage temperature

# Training launch:
python -m verl.trainer.main_ppo \
    --config-path=./config \
    --config-name=$CONFIG_NAME \
    algorithm.adv_estimator=$adv_estimator \
    actor_rollout_ref.actor.policy_loss.loss_mode=${loss_mode} \
    actor_rollout_ref.actor.tau_pos=$tau_pos \
    actor_rollout_ref.actor.tau_neg=$tau_neg \
    ...

Import

# Shell script — invoked via sbatch or bash:
sbatch examples/sapo_trainer/run_qwen30b_sapo.sh
# or
bash examples/sapo_trainer/run_qwen30b_sapo.sh

I/O Contract

Inputs

Name	Type	Required	Description
WANDB_API_KEY	env var	Yes	WandB API key for experiment logging
DATA_ROOT	env var	No	Root directory for datasets and checkpoints (defaults to PWD)
actor_model_path	config	Yes	HuggingFace model path (default: Qwen/Qwen3-30B-A3B-Base)
loss_mode	config	Yes	Must be "sapo" to enable smooth advantage (default: sapo)
tau_pos	float	Yes	Positive advantage smoothing temperature (default: 1.0)
tau_neg	float	Yes	Negative advantage smoothing temperature (default: 1.05)
train_files	path	Yes	Path to training parquet (DAPO-Math-17k)
test_files	path	Yes	Path to test parquet (AIME-2024)

Outputs

Name	Type	Description
checkpoint/	directory	Model checkpoints saved at configured frequency
WandB logs	metrics	Training metrics logged to WandB project
stdout/stderr	logs	Slurm job output files in logs/sapo/30B/

Usage Examples

Basic Slurm Submission

# Submit SAPO training job on Slurm cluster
export WANDB_API_KEY=your_key_here
sbatch examples/sapo_trainer/run_qwen30b_sapo.sh

Local Single-Node Run

# Run locally without Slurm (ensure Ray and 8 GPUs available)
export WANDB_API_KEY=your_key_here
export DATA_ROOT=/data/experiments
bash examples/sapo_trainer/run_qwen30b_sapo.sh

Key SAPO Parameters

# The SAPO-specific parameters in the script:
loss_mode=sapo       # Use smooth advantage instead of PPO clipping
tau_pos=1.0          # Temperature for positive advantages
tau_neg=1.05         # Temperature for negative advantages (slightly > tau_pos)
# Per the paper (arXiv:2511.20347), tau_neg > tau_pos creates asymmetric
# smoothing that penalizes bad actions more than it rewards good ones.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment