Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL RLVRConfig

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Configuration
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete configuration dataclass for RLVR training pipelines provided by the Alibaba ROLL library.

Description

The RLVRConfig class is a Python dataclass that extends PPOConfig with RLVR-specific settings. It manages multi-domain reward configurations, advantage estimation selection, reward normalization, KL penalty parameters, and distributed worker allocation. The class includes comprehensive post-initialization validation that sets default worker classes, builds domain-to-tag mappings, and validates parameter consistency.

Usage

Import and instantiate this class when configuring an RLVR training pipeline. Typically loaded from YAML via Hydra and dacite rather than constructed directly.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/pipeline/rlvr/rlvr_config.py
  • Lines: L82-169

Signature

@dataclass
class RLVRConfig(PPOConfig):
    """
    Configuration for RLVR (Reinforcement Learning with Verifiable Rewards) pipeline.

    Key Attributes:
        adv_estimator: str - advantage estimator ("grpo", "reinforce", "gae")
        norm_mean_type: str - reward normalization mean type
        norm_std_type: str - reward normalization std type
        reward_clip: float - reward clipping threshold
        advantage_clip: float - advantage clipping threshold
        init_kl_coef: float - initial KL penalty coefficient
        rewards: Optional[Dict[str, RewardConfig]] - per-domain reward configs
        num_return_sequences_in_group: int - samples per prompt for variance reduction
        domain_interleave_probs: Optional[Dict[str, float]] - domain sampling ratios
    """
    def __post_init__(self):
        """Validates configuration and sets default worker classes."""

Import

from roll.pipeline.rlvr.rlvr_config import RLVRConfig

I/O Contract

Inputs

Name Type Required Description
YAML config file str (path) Yes Hydra-managed YAML configuration file
CLI overrides str No Command-line parameter overrides (e.g., rollout_batch_size=128)

Outputs

Name Type Description
RLVRConfig instance RLVRConfig Fully validated configuration with all pipeline parameters
Worker configs WorkerConfig Nested configs for actor_train, actor_infer, reference, critic, reward workers

Usage Examples

Loading from YAML

from hydra import compose, initialize
from omegaconf import OmegaConf
import dacite

# 1. Load YAML configuration via Hydra
initialize(config_path="examples/qwen2.5-7B-rlvr_megatron")
cfg = compose(config_name="rlvr_config")

# 2. Convert to RLVRConfig dataclass
config_dict = OmegaConf.to_container(cfg, resolve=True)
rlvr_config = dacite.from_dict(data_class=RLVRConfig, data=config_dict)

# 3. Access key parameters
print(rlvr_config.adv_estimator)           # "grpo"
print(rlvr_config.init_kl_coef)            # 0.1
print(rlvr_config.reward_clip)             # 5.0
print(rlvr_config.num_return_sequences_in_group)  # 8

CLI Override

# Override parameters from command line
# python examples/start_rlvr_pipeline.py rollout_batch_size=128 max_steps=1000

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment