Principle:Volcengine Verl Rollout Generation
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Inference, Distributed_Systems |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The process of generating text completions from the current policy model using an optimized inference engine, producing responses with associated log probabilities for policy gradient computation.
Description
Rollout Generation is the sampling step in the RL training loop where the current policy model generates responses to input prompts. In verl, this is performed by dedicated rollout workers running optimized inference engines (vLLM or SGLang) that are separate from the training workers.
The rollout phase serves multiple purposes:
- Generate multiple candidate responses per prompt (for GRPO group-based advantage estimation)
- Collect token-level log probabilities under the current policy (needed for the PPO ratio computation)
- Optionally collect reference model log probabilities (for KL penalty computation)
- Support multimodal inputs (images, video) for vision-language models
The key architectural insight is the decoupling of training (FSDP/Megatron) and inference (vLLM/SGLang) backends. Model weights are periodically synced from the training actor to the rollout engine, allowing each to use its optimal parallelism strategy.
Usage
Rollout generation is executed at the beginning of each training iteration. The number of samples per prompt (group size) is a critical hyperparameter:
- For GRPO: typically 4-16 samples per prompt
- For PPO: typically 1 sample per prompt
- Larger group sizes improve advantage estimation stability but increase compute cost
Theoretical Basis
The rollout process generates trajectories from the current policy:
Where at each step :
- — tokens are sampled from the policy
- — log probabilities are recorded for the policy gradient
Key sampling parameters:
- Temperature: Controls randomness of sampling (higher = more diverse)
- Top-p: Nucleus sampling threshold
- Group size (n): Number of completions per prompt
Pseudo-code:
# Abstract rollout generation
for batch in prompts:
# Sync weights from actor to rollout engine
rollout_engine.update_weights(actor.parameters())
# Generate n completions per prompt
outputs = rollout_engine.generate(
prompts=batch,
n=group_size,
temperature=temperature,
max_tokens=max_new_tokens
)
# Collect responses and log probs
responses, log_probs = extract(outputs)