Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Hpcaitech ColossalAI RLVRRewardModel

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, NLP
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for computing verifiable rewards from multiple reward functions, provided by ColossalChat.

Description

RLVRRewardModel wraps a list of callable reward functions, applying each to generated responses and aggregating their scores. The VerifiableReward class in the distributed RL pipeline provides a similar interface with support for gt_answer-based and test_cases-based reward functions.

Usage

Create with a list of reward functions (e.g., math_reward, code_reward) and call with generated responses and ground-truth answers.

Code Reference

Source Location

  • Repository: ColossalAI
  • File (RLVRRewardModel): applications/ColossalChat/coati/models/rlvr_reward_model.py
  • Lines: 10-50
  • File (VerifiableReward): applications/ColossalChat/coati/distributed/reward/verifiable_reward.py
  • Lines: 11-71

Signature

class RLVRRewardModel:
    def __init__(self, reward_fn_list: List[Callable], **kwargs) -> None:
        """
        Args:
            reward_fn_list: List of reward functions
            **kwargs: Additional keyword args for reward functions
        """

    def __call__(
        self,
        input_ids: torch.LongTensor,
        attention_mask: Optional[torch.Tensor] = None,
        response_start: List = None,
        response_end: List = None,
        gt_answer: List = None,
    ) -> torch.Tensor:
        """Compute rewards for each sample using all reward functions."""

class VerifiableReward:
    def __init__(self, reward_fns: List[callable], **kwargs):
        """Distributed version with support for gt_answer and test_cases."""

    def __call__(
        self,
        input_ids: torch.LongTensor,
        gt_answer: List[str] = None,
        test_cases: List[str] = None,
        response_idx: List[torch.Tensor] = None,
    ) -> torch.Tensor:
        """Returns tensor of shape (batch_size, 3) with reward scores."""

Import

from coati.models.rlvr_reward_model import RLVRRewardModel
from coati.distributed.reward.verifiable_reward import VerifiableReward
from coati.distributed.reward.reward_fn import math_reward, code_reward

I/O Contract

Inputs

Name Type Required Description
reward_fn_list List[Callable] Yes List of reward functions (math_reward, code_reward, etc.)
input_ids torch.LongTensor Yes Tokenized responses to evaluate
gt_answer List[str] No Ground-truth answers for verification
test_cases List[str] No Code test cases for execution-based verification

Outputs

Name Type Description
rewards torch.Tensor Reward scores per sample (shape: [batch_size] or [batch_size, num_fns])

Usage Examples

from coati.distributed.reward.verifiable_reward import VerifiableReward
from coati.distributed.reward.reward_fn import math_reward

# Create verifiable reward with math checking
reward_model = VerifiableReward(
    reward_fns=[math_reward],
)

# Score generated responses
rewards = reward_model(
    input_ids=generated_ids,
    gt_answer=ground_truth_answers,
    response_idx=response_indices,
)

Related Pages

Implements Principle

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment