Principle:Volcengine Verl Rollout Generation

Knowledge Sources	Proximal Policy Optimization Algorithms Efficient Memory Management for Large Language Model Serving with PagedAttention verl Architecture Documentation
Domains	Reinforcement_Learning, Inference, Distributed_Systems
Last Updated	2026-02-07 14:00 GMT

Overview

The process of generating text completions from the current policy model using an optimized inference engine, producing responses with associated log probabilities for policy gradient computation.

Description

Rollout Generation is the sampling step in the RL training loop where the current policy model generates responses to input prompts. In verl, this is performed by dedicated rollout workers running optimized inference engines (vLLM or SGLang) that are separate from the training workers.

The rollout phase serves multiple purposes:

Generate multiple candidate responses per prompt (for GRPO group-based advantage estimation)
Collect token-level log probabilities under the current policy (needed for the PPO ratio computation)
Optionally collect reference model log probabilities (for KL penalty computation)
Support multimodal inputs (images, video) for vision-language models

The key architectural insight is the decoupling of training (FSDP/Megatron) and inference (vLLM/SGLang) backends. Model weights are periodically synced from the training actor to the rollout engine, allowing each to use its optimal parallelism strategy.

Usage

Rollout generation is executed at the beginning of each training iteration. The number of samples per prompt (group size) is a critical hyperparameter:

For GRPO: typically 4-16 samples per prompt
For PPO: typically 1 sample per prompt
Larger group sizes improve advantage estimation stability but increase compute cost

Theoretical Basis

The rollout process generates trajectories from the current policy:

$τ = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots, s_{T})$

Where at each step $t$ :

$a_{t} \sim π_{θ} (\cdot | s_{t})$ — tokens are sampled from the policy
$\log π_{θ} (a_{t} | s_{t})$ — log probabilities are recorded for the policy gradient

Key sampling parameters:

Temperature: Controls randomness of sampling (higher = more diverse)
Top-p: Nucleus sampling threshold
Group size (n): Number of completions per prompt

Pseudo-code:

# Abstract rollout generation
for batch in prompts:
    # Sync weights from actor to rollout engine
    rollout_engine.update_weights(actor.parameters())
    # Generate n completions per prompt
    outputs = rollout_engine.generate(
        prompts=batch,
        n=group_size,
        temperature=temperature,
        max_tokens=max_new_tokens
    )
    # Collect responses and log probs
    responses, log_probs = extract(outputs)

Related Pages

Implemented By

Implementation:Volcengine_Verl_RolloutConfig

Heuristics Used

Heuristic:Volcengine_Verl_GPU_Memory_Utilization_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment