Implementation:Alibaba ROLL MathRuleRewardWorker Compute Rewards

Knowledge Sources	Alibaba ROLL
Domains	Reinforcement_Learning, Reward_Modeling
Last Updated	2026-02-07 20:00 GMT

Overview

Concrete rule-based reward worker for mathematical verification provided by the Alibaba ROLL library.

Description

The MathRuleRewardWorker class computes rewards for math domain responses using symbolic verification. It parses generated responses for mathematical expressions, compares them against ground-truth answers using SymPy-based equivalence checking, and applies formatting penalties (repetition, format compliance). The worker operates as a Ray actor within a reward cluster and processes batches of generated responses in parallel.

ROLL also provides other reward workers for different domains:

CodeSandboxRewardWorker - Sandboxed code execution with test cases
LLMJudgeRewardWorker - LLM-based quality evaluation
GeneralRuleRewardWorker - IFEval and instruction-following checks

Usage

This worker is automatically instantiated as part of the reward cluster when the RLVR pipeline is configured with math domain data. It is called during the reward computation step of each training iteration.

Code Reference

Source Location

Repository: Alibaba ROLL
File: roll/pipeline/rlvr/rewards/math_rule_reward_worker.py
Lines: L242-261

Signature

class MathRuleRewardWorker(Worker):
    def __init__(self, worker_config: WorkerConfig) -> None:
        """Initialize math rule reward worker with tokenizer and reward functions."""

    @register(dispatch_mode=Dispatch.DP_MP_COMPUTE, clear_cache=False)
    def compute_rewards(self, data: DataProto) -> DataProto:
        """
        Compute rewards using mathematical rule-based evaluation.

        Args:
            data: DataProto containing generated responses with input_ids,
                  attention_mask, and ground truth answers

        Returns:
            DataProto with response_level_rewards tensor (per-sample float scores)
        """

Import

from roll.pipeline.rlvr.rewards.math_rule_reward_worker import MathRuleRewardWorker

I/O Contract

Inputs

Name	Type	Required	Description
data	DataProto	Yes	Batch with generated responses (input_ids, attention_mask, decoded text)
ground_truth	str	Yes	Expected correct answer (embedded in DataProto metadata)

Outputs

Name	Type	Description
response_level_rewards	torch.Tensor	Per-sample reward scores (0.0 or 1.0 for binary verification)
scores	torch.Tensor	Raw verification scores before penalty adjustments

Usage Examples

Reward Worker in Pipeline

# Reward workers are typically created and called by the RLVR pipeline automatically.
# Manual usage for illustration:

from roll.pipeline.rlvr.rewards.math_rule_reward_worker import MathRuleRewardWorker
from roll.distributed.scheduler.protocol import DataProto

# Worker is initialized in the reward cluster
worker = MathRuleRewardWorker(worker_config=reward_config)

# Compute rewards on a batch of generated responses
reward_data = worker.compute_rewards(data=batch_with_responses)

# Access reward scores
rewards = reward_data.batch["response_level_rewards"]  # shape: (batch_size,)

Related Pages

Implements Principle

Principle:Alibaba_ROLL_Verifiable_Reward_Computation

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Environment:Alibaba_ROLL_Python_Runtime_Environment

Heuristics Applied

This implementation uses the following heuristics:

Heuristic:Alibaba_ROLL_Reward_Clipping_Normalization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment