Implementation:Alibaba ROLL RLVRConfig
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Configuration |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete configuration dataclass for RLVR training pipelines provided by the Alibaba ROLL library.
Description
The RLVRConfig class is a Python dataclass that extends PPOConfig with RLVR-specific settings. It manages multi-domain reward configurations, advantage estimation selection, reward normalization, KL penalty parameters, and distributed worker allocation. The class includes comprehensive post-initialization validation that sets default worker classes, builds domain-to-tag mappings, and validates parameter consistency.
Usage
Import and instantiate this class when configuring an RLVR training pipeline. Typically loaded from YAML via Hydra and dacite rather than constructed directly.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/pipeline/rlvr/rlvr_config.py
- Lines: L82-169
Signature
@dataclass
class RLVRConfig(PPOConfig):
"""
Configuration for RLVR (Reinforcement Learning with Verifiable Rewards) pipeline.
Key Attributes:
adv_estimator: str - advantage estimator ("grpo", "reinforce", "gae")
norm_mean_type: str - reward normalization mean type
norm_std_type: str - reward normalization std type
reward_clip: float - reward clipping threshold
advantage_clip: float - advantage clipping threshold
init_kl_coef: float - initial KL penalty coefficient
rewards: Optional[Dict[str, RewardConfig]] - per-domain reward configs
num_return_sequences_in_group: int - samples per prompt for variance reduction
domain_interleave_probs: Optional[Dict[str, float]] - domain sampling ratios
"""
def __post_init__(self):
"""Validates configuration and sets default worker classes."""
Import
from roll.pipeline.rlvr.rlvr_config import RLVRConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| YAML config file | str (path) | Yes | Hydra-managed YAML configuration file |
| CLI overrides | str | No | Command-line parameter overrides (e.g., rollout_batch_size=128) |
Outputs
| Name | Type | Description |
|---|---|---|
| RLVRConfig instance | RLVRConfig | Fully validated configuration with all pipeline parameters |
| Worker configs | WorkerConfig | Nested configs for actor_train, actor_infer, reference, critic, reward workers |
Usage Examples
Loading from YAML
from hydra import compose, initialize
from omegaconf import OmegaConf
import dacite
# 1. Load YAML configuration via Hydra
initialize(config_path="examples/qwen2.5-7B-rlvr_megatron")
cfg = compose(config_name="rlvr_config")
# 2. Convert to RLVRConfig dataclass
config_dict = OmegaConf.to_container(cfg, resolve=True)
rlvr_config = dacite.from_dict(data_class=RLVRConfig, data=config_dict)
# 3. Access key parameters
print(rlvr_config.adv_estimator) # "grpo"
print(rlvr_config.init_kl_coef) # 0.1
print(rlvr_config.reward_clip) # 5.0
print(rlvr_config.num_return_sequences_in_group) # 8
CLI Override
# Override parameters from command line
# python examples/start_rlvr_pipeline.py rollout_batch_size=128 max_steps=1000
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
Heuristics Applied
This implementation uses the following heuristics: