Workflow:Volcengine Verl GRPO Training Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Reinforcement_Learning, Post_Training |
| Last Updated | 2026-02-07 18:00 GMT |
Overview
End-to-end process for training large language models using Group Relative Policy Optimization (GRPO) with verl, from data preparation through distributed RL training with vLLM rollout generation.
Description
This workflow covers the standard procedure for post-training LLMs using GRPO, a critic-less reinforcement learning algorithm. GRPO eliminates the need for a separate value network (critic) by sampling multiple completions per prompt and using group-relative advantages for policy optimization. The workflow leverages verl's hybrid-controller architecture with FSDP or Megatron-LM for training and vLLM or SGLang for efficient rollout generation. It supports full-parameter training or parameter-efficient fine-tuning via LoRA.
Usage
Execute this workflow when you have a task-specific dataset (e.g., math reasoning, coding, instruction following) in parquet format and want to improve an LLM's performance through reinforcement learning without training a separate reward model or critic network. This is the recommended starting point for most RL post-training scenarios with verl, especially when verifiable rewards (rule-based evaluation) are available.
Execution Steps
Step 1: Environment Setup and Installation
Install verl with the desired backend engines. This includes the core verl library, a training backend (FSDP or Megatron-LM), and an inference engine for rollout generation (vLLM or SGLang). Configure Ray for distributed execution across available GPUs.
Key considerations:
- Choose FSDP for simpler setup or Megatron-LM for large-scale models requiring tensor/pipeline parallelism
- Ensure vLLM >= 0.8.2 or latest SGLang is installed for rollout compatibility
- Ray must be initialized for distributed worker coordination
Step 2: Data Preparation
Convert raw training data into verl's standardized parquet format. Each record must contain a structured chat-format prompt, data source identifier, ability tag, and reward model configuration specifying either rule-based or model-based reward evaluation.
Key considerations:
- Prompts must be in chat message format (list of role/content dictionaries)
- For rule-based rewards, include ground truth in the reward_model field
- The parquet schema requires: data_source, prompt, ability, reward_model, extra_info columns
- Both train and test splits should be prepared for evaluation during training
Step 3: Model Selection and Configuration
Select a HuggingFace-compatible base model and configure the training parameters. This includes setting batch sizes, learning rates, KL divergence coefficients, and the number of response samples per prompt (the group size for GRPO).
Key considerations:
- Group size (n) is critical for GRPO — typically 5-16 samples per prompt
- KL loss coefficient (typically 0.001) prevents the policy from diverging too far from the reference
- For LoRA training, configure rank, alpha, and target modules to reduce memory requirements
- Configure FSDP param/optimizer offloading based on available GPU memory
Step 4: Rollout Generation
Generate multiple response completions for each prompt in the training batch using the current policy. The rollout engine (vLLM or SGLang) runs the model in inference mode with tensor parallelism for throughput, producing n samples per prompt that form the comparison group.
Key considerations:
- Tensor parallel size for rollout is configured independently from training
- Async rollout mode can overlap generation with training for better throughput
- Temperature and top-p sampling parameters affect exploration diversity
- GPU memory utilization for the rollout engine should be tuned to avoid OOM
Step 5: Reward Computation
Score each generated response using either a rule-based reward function (e.g., exact match against ground truth) or a learned reward model. For math tasks, this typically involves extracting the final answer and comparing it to the known solution.
Key considerations:
- Rule-based rewards provide deterministic, verifiable signals (preferred when available)
- Custom reward functions can be registered and loaded dynamically
- Reward scores are attached to each response in the DataProto batch
Step 6: Advantage Estimation and Policy Update
Compute group-relative advantages by normalizing rewards within each prompt's response group (subtracting group mean, dividing by group standard deviation). Use these advantages to update the actor policy via the clipped surrogate objective with KL regularization.
Key considerations:
- GRPO normalizes advantages within each group, making it robust to reward scale
- The actor is updated for multiple PPO epochs over mini-batches of the trajectory data
- KL loss is computed against a frozen reference model to prevent mode collapse
- Clip ratio (default 0.2) bounds the magnitude of policy updates
Step 7: Evaluation and Checkpointing
Periodically evaluate the updated policy on a held-out test set and save model checkpoints. Track training metrics (reward mean, KL divergence, policy loss) via experiment tracking tools like Weights & Biases, MLflow, or TensorBoard.
Key considerations:
- Test frequency and checkpoint interval are configurable
- Checkpoints support FSDP and Megatron formats with conversion to HuggingFace
- LoRA adapters can be merged back into base weights after training
- Early stopping can be configured based on evaluation metrics