Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl SAPO Training Script

From Leeroopedia
Revision as of 17:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Volcengine_Verl_SAPO_Training_Script.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Reinforcement_Learning, Training_Scripts, RLHF
Last Updated 2026-02-07 18:00 GMT

Overview

Concrete tool for running SAPO (Smooth Advantage Policy Optimization) training on Qwen3-30B-A3B-Base using the verl framework with Slurm multi-node orchestration.

Description

This shell script provides a complete, production-ready training configuration for the SAPO algorithm on the Qwen3-30B-A3B-Base model. SAPO replaces the standard PPO clipping mechanism with a smooth exponential advantage function controlled by positive and negative temperature parameters (tau_pos, tau_neg). The script handles:

  • Slurm job submission with multi-node Ray cluster initialization
  • Dataset preparation (DAPO-Math-17k for training, AIME-2024 for testing)
  • FSDP or Megatron-LM backend selection for distributed training
  • vLLM-based async rollout generation with configurable tensor/data/expert parallelism
  • DAPO reward manager with overlong buffer penalty configuration
  • WandB experiment tracking

Usage

Use this script when training a large MoE language model (Qwen3-30B-A3B) with the SAPO algorithm on a Slurm-managed GPU cluster. It serves as the reference example for SAPO configuration in verl, demonstrating how to set the smooth advantage parameters (tau_pos, tau_neg) and loss_mode=sapo.

Code Reference

Source Location

Signature

#!/bin/bash
#SBATCH --job-name=sapo-30B
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --gres=gpu:8
#SBATCH --gpus-per-node=8

# Key algorithm parameters:
adv_estimator=grpo
loss_mode=sapo           # SAPO uses smoothing, not clipping
tau_pos=1.0              # Positive advantage temperature
tau_neg=1.05             # Negative advantage temperature

# Training launch:
python -m verl.trainer.main_ppo \
    --config-path=./config \
    --config-name=$CONFIG_NAME \
    algorithm.adv_estimator=$adv_estimator \
    actor_rollout_ref.actor.policy_loss.loss_mode=${loss_mode} \
    actor_rollout_ref.actor.tau_pos=$tau_pos \
    actor_rollout_ref.actor.tau_neg=$tau_neg \
    ...

Import

# Shell script — invoked via sbatch or bash:
sbatch examples/sapo_trainer/run_qwen30b_sapo.sh
# or
bash examples/sapo_trainer/run_qwen30b_sapo.sh

I/O Contract

Inputs

Name Type Required Description
WANDB_API_KEY env var Yes WandB API key for experiment logging
DATA_ROOT env var No Root directory for datasets and checkpoints (defaults to PWD)
actor_model_path config Yes HuggingFace model path (default: Qwen/Qwen3-30B-A3B-Base)
loss_mode config Yes Must be "sapo" to enable smooth advantage (default: sapo)
tau_pos float Yes Positive advantage smoothing temperature (default: 1.0)
tau_neg float Yes Negative advantage smoothing temperature (default: 1.05)
train_files path Yes Path to training parquet (DAPO-Math-17k)
test_files path Yes Path to test parquet (AIME-2024)

Outputs

Name Type Description
checkpoint/ directory Model checkpoints saved at configured frequency
WandB logs metrics Training metrics logged to WandB project
stdout/stderr logs Slurm job output files in logs/sapo/30B/

Usage Examples

Basic Slurm Submission

# Submit SAPO training job on Slurm cluster
export WANDB_API_KEY=your_key_here
sbatch examples/sapo_trainer/run_qwen30b_sapo.sh

Local Single-Node Run

# Run locally without Slurm (ensure Ray and 8 GPUs available)
export WANDB_API_KEY=your_key_here
export DATA_ROOT=/data/experiments
bash examples/sapo_trainer/run_qwen30b_sapo.sh

Key SAPO Parameters

# The SAPO-specific parameters in the script:
loss_mode=sapo       # Use smooth advantage instead of PPO clipping
tau_pos=1.0          # Temperature for positive advantages
tau_neg=1.05         # Temperature for negative advantages (slightly > tau_pos)
# Per the paper (arXiv:2511.20347), tau_neg > tau_pos creates asymmetric
# smoothing that penalizes bad actions more than it rewards good ones.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment