Implementation:Volcengine Verl SAPO Training Script
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Training_Scripts, RLHF |
| Last Updated | 2026-02-07 18:00 GMT |
Overview
Concrete tool for running SAPO (Smooth Advantage Policy Optimization) training on Qwen3-30B-A3B-Base using the verl framework with Slurm multi-node orchestration.
Description
This shell script provides a complete, production-ready training configuration for the SAPO algorithm on the Qwen3-30B-A3B-Base model. SAPO replaces the standard PPO clipping mechanism with a smooth exponential advantage function controlled by positive and negative temperature parameters (tau_pos, tau_neg). The script handles:
- Slurm job submission with multi-node Ray cluster initialization
- Dataset preparation (DAPO-Math-17k for training, AIME-2024 for testing)
- FSDP or Megatron-LM backend selection for distributed training
- vLLM-based async rollout generation with configurable tensor/data/expert parallelism
- DAPO reward manager with overlong buffer penalty configuration
- WandB experiment tracking
Usage
Use this script when training a large MoE language model (Qwen3-30B-A3B) with the SAPO algorithm on a Slurm-managed GPU cluster. It serves as the reference example for SAPO configuration in verl, demonstrating how to set the smooth advantage parameters (tau_pos, tau_neg) and loss_mode=sapo.
Code Reference
Source Location
- Repository: Volcengine_Verl
- File: examples/sapo_trainer/run_qwen30b_sapo.sh
- Lines: 1-373
Signature
#!/bin/bash
#SBATCH --job-name=sapo-30B
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --gres=gpu:8
#SBATCH --gpus-per-node=8
# Key algorithm parameters:
adv_estimator=grpo
loss_mode=sapo # SAPO uses smoothing, not clipping
tau_pos=1.0 # Positive advantage temperature
tau_neg=1.05 # Negative advantage temperature
# Training launch:
python -m verl.trainer.main_ppo \
--config-path=./config \
--config-name=$CONFIG_NAME \
algorithm.adv_estimator=$adv_estimator \
actor_rollout_ref.actor.policy_loss.loss_mode=${loss_mode} \
actor_rollout_ref.actor.tau_pos=$tau_pos \
actor_rollout_ref.actor.tau_neg=$tau_neg \
...
Import
# Shell script — invoked via sbatch or bash:
sbatch examples/sapo_trainer/run_qwen30b_sapo.sh
# or
bash examples/sapo_trainer/run_qwen30b_sapo.sh
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| WANDB_API_KEY | env var | Yes | WandB API key for experiment logging |
| DATA_ROOT | env var | No | Root directory for datasets and checkpoints (defaults to PWD) |
| actor_model_path | config | Yes | HuggingFace model path (default: Qwen/Qwen3-30B-A3B-Base) |
| loss_mode | config | Yes | Must be "sapo" to enable smooth advantage (default: sapo) |
| tau_pos | float | Yes | Positive advantage smoothing temperature (default: 1.0) |
| tau_neg | float | Yes | Negative advantage smoothing temperature (default: 1.05) |
| train_files | path | Yes | Path to training parquet (DAPO-Math-17k) |
| test_files | path | Yes | Path to test parquet (AIME-2024) |
Outputs
| Name | Type | Description |
|---|---|---|
| checkpoint/ | directory | Model checkpoints saved at configured frequency |
| WandB logs | metrics | Training metrics logged to WandB project |
| stdout/stderr | logs | Slurm job output files in logs/sapo/30B/ |
Usage Examples
Basic Slurm Submission
# Submit SAPO training job on Slurm cluster
export WANDB_API_KEY=your_key_here
sbatch examples/sapo_trainer/run_qwen30b_sapo.sh
Local Single-Node Run
# Run locally without Slurm (ensure Ray and 8 GPUs available)
export WANDB_API_KEY=your_key_here
export DATA_ROOT=/data/experiments
bash examples/sapo_trainer/run_qwen30b_sapo.sh
Key SAPO Parameters
# The SAPO-specific parameters in the script:
loss_mode=sapo # Use smooth advantage instead of PPO clipping
tau_pos=1.0 # Temperature for positive advantages
tau_neg=1.05 # Temperature for negative advantages (slightly > tau_pos)
# Per the paper (arXiv:2511.20347), tau_neg > tau_pos creates asymmetric
# smoothing that penalizes bad actions more than it rewards good ones.