Implementation:Alibaba ROLL MathRuleRewardWorker Compute Rewards
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Reward_Modeling |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete rule-based reward worker for mathematical verification provided by the Alibaba ROLL library.
Description
The MathRuleRewardWorker class computes rewards for math domain responses using symbolic verification. It parses generated responses for mathematical expressions, compares them against ground-truth answers using SymPy-based equivalence checking, and applies formatting penalties (repetition, format compliance). The worker operates as a Ray actor within a reward cluster and processes batches of generated responses in parallel.
ROLL also provides other reward workers for different domains:
- CodeSandboxRewardWorker - Sandboxed code execution with test cases
- LLMJudgeRewardWorker - LLM-based quality evaluation
- GeneralRuleRewardWorker - IFEval and instruction-following checks
Usage
This worker is automatically instantiated as part of the reward cluster when the RLVR pipeline is configured with math domain data. It is called during the reward computation step of each training iteration.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/pipeline/rlvr/rewards/math_rule_reward_worker.py
- Lines: L242-261
Signature
class MathRuleRewardWorker(Worker):
def __init__(self, worker_config: WorkerConfig) -> None:
"""Initialize math rule reward worker with tokenizer and reward functions."""
@register(dispatch_mode=Dispatch.DP_MP_COMPUTE, clear_cache=False)
def compute_rewards(self, data: DataProto) -> DataProto:
"""
Compute rewards using mathematical rule-based evaluation.
Args:
data: DataProto containing generated responses with input_ids,
attention_mask, and ground truth answers
Returns:
DataProto with response_level_rewards tensor (per-sample float scores)
"""
Import
from roll.pipeline.rlvr.rewards.math_rule_reward_worker import MathRuleRewardWorker
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | DataProto | Yes | Batch with generated responses (input_ids, attention_mask, decoded text) |
| ground_truth | str | Yes | Expected correct answer (embedded in DataProto metadata) |
Outputs
| Name | Type | Description |
|---|---|---|
| response_level_rewards | torch.Tensor | Per-sample reward scores (0.0 or 1.0 for binary verification) |
| scores | torch.Tensor | Raw verification scores before penalty adjustments |
Usage Examples
Reward Worker in Pipeline
# Reward workers are typically created and called by the RLVR pipeline automatically.
# Manual usage for illustration:
from roll.pipeline.rlvr.rewards.math_rule_reward_worker import MathRuleRewardWorker
from roll.distributed.scheduler.protocol import DataProto
# Worker is initialized in the reward cluster
worker = MathRuleRewardWorker(worker_config=reward_config)
# Compute rewards on a batch of generated responses
reward_data = worker.compute_rewards(data=batch_with_responses)
# Access reward scores
rewards = reward_data.batch["response_level_rewards"] # shape: (batch_size,)
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
Heuristics Applied
This implementation uses the following heuristics: