Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Volcengine Verl Rollout Generation

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Inference, Distributed_Systems
Last Updated 2026-02-07 14:00 GMT

Overview

The process of generating text completions from the current policy model using an optimized inference engine, producing responses with associated log probabilities for policy gradient computation.

Description

Rollout Generation is the sampling step in the RL training loop where the current policy model generates responses to input prompts. In verl, this is performed by dedicated rollout workers running optimized inference engines (vLLM or SGLang) that are separate from the training workers.

The rollout phase serves multiple purposes:

  • Generate multiple candidate responses per prompt (for GRPO group-based advantage estimation)
  • Collect token-level log probabilities under the current policy (needed for the PPO ratio computation)
  • Optionally collect reference model log probabilities (for KL penalty computation)
  • Support multimodal inputs (images, video) for vision-language models

The key architectural insight is the decoupling of training (FSDP/Megatron) and inference (vLLM/SGLang) backends. Model weights are periodically synced from the training actor to the rollout engine, allowing each to use its optimal parallelism strategy.

Usage

Rollout generation is executed at the beginning of each training iteration. The number of samples per prompt (group size) is a critical hyperparameter:

  • For GRPO: typically 4-16 samples per prompt
  • For PPO: typically 1 sample per prompt
  • Larger group sizes improve advantage estimation stability but increase compute cost

Theoretical Basis

The rollout process generates trajectories from the current policy:

τ=(s0,a0,r0,s1,a1,r1,,sT)

Where at each step t:

  • atπθ(|st) — tokens are sampled from the policy
  • logπθ(at|st) — log probabilities are recorded for the policy gradient

Key sampling parameters:

  • Temperature: Controls randomness of sampling (higher = more diverse)
  • Top-p: Nucleus sampling threshold
  • Group size (n): Number of completions per prompt

Pseudo-code:

# Abstract rollout generation
for batch in prompts:
    # Sync weights from actor to rollout engine
    rollout_engine.update_weights(actor.parameters())
    # Generate n completions per prompt
    outputs = rollout_engine.generate(
        prompts=batch,
        n=group_size,
        temperature=temperature,
        max_tokens=max_new_tokens
    )
    # Collect responses and log probs
    responses, log_probs = extract(outputs)

Related Pages

Implemented By

Heuristics Used

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment