Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL MathRuleRewardWorker Compute Rewards

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Reward_Modeling
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete rule-based reward worker for mathematical verification provided by the Alibaba ROLL library.

Description

The MathRuleRewardWorker class computes rewards for math domain responses using symbolic verification. It parses generated responses for mathematical expressions, compares them against ground-truth answers using SymPy-based equivalence checking, and applies formatting penalties (repetition, format compliance). The worker operates as a Ray actor within a reward cluster and processes batches of generated responses in parallel.

ROLL also provides other reward workers for different domains:

  • CodeSandboxRewardWorker - Sandboxed code execution with test cases
  • LLMJudgeRewardWorker - LLM-based quality evaluation
  • GeneralRuleRewardWorker - IFEval and instruction-following checks

Usage

This worker is automatically instantiated as part of the reward cluster when the RLVR pipeline is configured with math domain data. It is called during the reward computation step of each training iteration.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/pipeline/rlvr/rewards/math_rule_reward_worker.py
  • Lines: L242-261

Signature

class MathRuleRewardWorker(Worker):
    def __init__(self, worker_config: WorkerConfig) -> None:
        """Initialize math rule reward worker with tokenizer and reward functions."""

    @register(dispatch_mode=Dispatch.DP_MP_COMPUTE, clear_cache=False)
    def compute_rewards(self, data: DataProto) -> DataProto:
        """
        Compute rewards using mathematical rule-based evaluation.

        Args:
            data: DataProto containing generated responses with input_ids,
                  attention_mask, and ground truth answers

        Returns:
            DataProto with response_level_rewards tensor (per-sample float scores)
        """

Import

from roll.pipeline.rlvr.rewards.math_rule_reward_worker import MathRuleRewardWorker

I/O Contract

Inputs

Name Type Required Description
data DataProto Yes Batch with generated responses (input_ids, attention_mask, decoded text)
ground_truth str Yes Expected correct answer (embedded in DataProto metadata)

Outputs

Name Type Description
response_level_rewards torch.Tensor Per-sample reward scores (0.0 or 1.0 for binary verification)
scores torch.Tensor Raw verification scores before penalty adjustments

Usage Examples

Reward Worker in Pipeline

# Reward workers are typically created and called by the RLVR pipeline automatically.
# Manual usage for illustration:

from roll.pipeline.rlvr.rewards.math_rule_reward_worker import MathRuleRewardWorker
from roll.distributed.scheduler.protocol import DataProto

# Worker is initialized in the reward cluster
worker = MathRuleRewardWorker(worker_config=reward_config)

# Compute rewards on a batch of generated responses
reward_data = worker.compute_rewards(data=batch_with_responses)

# Access reward scores
rewards = reward_data.batch["response_level_rewards"]  # shape: (batch_size,)

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment