Workflow:Volcengine Verl GRPO Training Pipeline

Knowledge Sources	verl verl Documentation HybridFlow GRPO Algorithm
Domains	LLMs, Reinforcement_Learning, Post_Training
Last Updated	2026-02-07 18:00 GMT

Overview

End-to-end process for training large language models using Group Relative Policy Optimization (GRPO) with verl, from data preparation through distributed RL training with vLLM rollout generation.

Description

This workflow covers the standard procedure for post-training LLMs using GRPO, a critic-less reinforcement learning algorithm. GRPO eliminates the need for a separate value network (critic) by sampling multiple completions per prompt and using group-relative advantages for policy optimization. The workflow leverages verl's hybrid-controller architecture with FSDP or Megatron-LM for training and vLLM or SGLang for efficient rollout generation. It supports full-parameter training or parameter-efficient fine-tuning via LoRA.

Usage

Execute this workflow when you have a task-specific dataset (e.g., math reasoning, coding, instruction following) in parquet format and want to improve an LLM's performance through reinforcement learning without training a separate reward model or critic network. This is the recommended starting point for most RL post-training scenarios with verl, especially when verifiable rewards (rule-based evaluation) are available.

Execution Steps

Step 1: Environment Setup and Installation

Install verl with the desired backend engines. This includes the core verl library, a training backend (FSDP or Megatron-LM), and an inference engine for rollout generation (vLLM or SGLang). Configure Ray for distributed execution across available GPUs.

Key considerations:

Choose FSDP for simpler setup or Megatron-LM for large-scale models requiring tensor/pipeline parallelism
Ensure vLLM >= 0.8.2 or latest SGLang is installed for rollout compatibility
Ray must be initialized for distributed worker coordination

Step 2: Data Preparation

Convert raw training data into verl's standardized parquet format. Each record must contain a structured chat-format prompt, data source identifier, ability tag, and reward model configuration specifying either rule-based or model-based reward evaluation.

Key considerations:

Prompts must be in chat message format (list of role/content dictionaries)
For rule-based rewards, include ground truth in the reward_model field
The parquet schema requires: data_source, prompt, ability, reward_model, extra_info columns
Both train and test splits should be prepared for evaluation during training

Step 3: Model Selection and Configuration

Select a HuggingFace-compatible base model and configure the training parameters. This includes setting batch sizes, learning rates, KL divergence coefficients, and the number of response samples per prompt (the group size for GRPO).

Key considerations:

Group size (n) is critical for GRPO — typically 5-16 samples per prompt
KL loss coefficient (typically 0.001) prevents the policy from diverging too far from the reference
For LoRA training, configure rank, alpha, and target modules to reduce memory requirements
Configure FSDP param/optimizer offloading based on available GPU memory

Step 4: Rollout Generation

Generate multiple response completions for each prompt in the training batch using the current policy. The rollout engine (vLLM or SGLang) runs the model in inference mode with tensor parallelism for throughput, producing n samples per prompt that form the comparison group.

Key considerations:

Tensor parallel size for rollout is configured independently from training
Async rollout mode can overlap generation with training for better throughput
Temperature and top-p sampling parameters affect exploration diversity
GPU memory utilization for the rollout engine should be tuned to avoid OOM

Step 5: Reward Computation

Score each generated response using either a rule-based reward function (e.g., exact match against ground truth) or a learned reward model. For math tasks, this typically involves extracting the final answer and comparing it to the known solution.

Key considerations:

Rule-based rewards provide deterministic, verifiable signals (preferred when available)
Custom reward functions can be registered and loaded dynamically
Reward scores are attached to each response in the DataProto batch

Step 6: Advantage Estimation and Policy Update

Compute group-relative advantages by normalizing rewards within each prompt's response group (subtracting group mean, dividing by group standard deviation). Use these advantages to update the actor policy via the clipped surrogate objective with KL regularization.

Key considerations:

GRPO normalizes advantages within each group, making it robust to reward scale
The actor is updated for multiple PPO epochs over mini-batches of the trajectory data
KL loss is computed against a frozen reference model to prevent mode collapse
Clip ratio (default 0.2) bounds the magnitude of policy updates

Step 7: Evaluation and Checkpointing

Periodically evaluate the updated policy on a held-out test set and save model checkpoints. Track training metrics (reward mean, KL divergence, policy loss) via experiment tracking tools like Weights & Biases, MLflow, or TensorBoard.

Key considerations:

Test frequency and checkpoint interval are configurable
Checkpoints support FSDP and Megatron formats with conversion to HuggingFace
LoRA adapters can be merged back into base weights after training
Early stopping can be configured based on evaluation metrics

Execution Diagram

GitHub URL

Workflow Repository