Implementation:Alibaba ROLL RLVRConfig

Knowledge Sources	Alibaba ROLL Hydra Configuration
Domains	Reinforcement_Learning, Configuration
Last Updated	2026-02-07 20:00 GMT

Overview

Concrete configuration dataclass for RLVR training pipelines provided by the Alibaba ROLL library.

Description

The RLVRConfig class is a Python dataclass that extends PPOConfig with RLVR-specific settings. It manages multi-domain reward configurations, advantage estimation selection, reward normalization, KL penalty parameters, and distributed worker allocation. The class includes comprehensive post-initialization validation that sets default worker classes, builds domain-to-tag mappings, and validates parameter consistency.

Usage

Import and instantiate this class when configuring an RLVR training pipeline. Typically loaded from YAML via Hydra and dacite rather than constructed directly.

Code Reference

Source Location

Repository: Alibaba ROLL
File: roll/pipeline/rlvr/rlvr_config.py
Lines: L82-169

Signature

@dataclass
class RLVRConfig(PPOConfig):
    """
    Configuration for RLVR (Reinforcement Learning with Verifiable Rewards) pipeline.

    Key Attributes:
        adv_estimator: str - advantage estimator ("grpo", "reinforce", "gae")
        norm_mean_type: str - reward normalization mean type
        norm_std_type: str - reward normalization std type
        reward_clip: float - reward clipping threshold
        advantage_clip: float - advantage clipping threshold
        init_kl_coef: float - initial KL penalty coefficient
        rewards: Optional[Dict[str, RewardConfig]] - per-domain reward configs
        num_return_sequences_in_group: int - samples per prompt for variance reduction
        domain_interleave_probs: Optional[Dict[str, float]] - domain sampling ratios
    """
    def __post_init__(self):
        """Validates configuration and sets default worker classes."""

Import

from roll.pipeline.rlvr.rlvr_config import RLVRConfig

I/O Contract

Inputs

Name	Type	Required	Description
YAML config file	str (path)	Yes	Hydra-managed YAML configuration file
CLI overrides	str	No	Command-line parameter overrides (e.g., rollout_batch_size=128)

Outputs

Name	Type	Description
RLVRConfig instance	RLVRConfig	Fully validated configuration with all pipeline parameters
Worker configs	WorkerConfig	Nested configs for actor_train, actor_infer, reference, critic, reward workers

Usage Examples

Loading from YAML

from hydra import compose, initialize
from omegaconf import OmegaConf
import dacite

# 1. Load YAML configuration via Hydra
initialize(config_path="examples/qwen2.5-7B-rlvr_megatron")
cfg = compose(config_name="rlvr_config")

# 2. Convert to RLVRConfig dataclass
config_dict = OmegaConf.to_container(cfg, resolve=True)
rlvr_config = dacite.from_dict(data_class=RLVRConfig, data=config_dict)

# 3. Access key parameters
print(rlvr_config.adv_estimator)           # "grpo"
print(rlvr_config.init_kl_coef)            # 0.1
print(rlvr_config.reward_clip)             # 5.0
print(rlvr_config.num_return_sequences_in_group)  # 8

CLI Override

# Override parameters from command line
# python examples/start_rlvr_pipeline.py rollout_batch_size=128 max_steps=1000

Related Pages

Implements Principle

Principle:Alibaba_ROLL_RLVR_Configuration

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Environment:Alibaba_ROLL_Python_Runtime_Environment

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment