Implementation:Hpcaitech ColossalAI RLVRRewardModel

Knowledge Sources	ColossalAI
Domains	Reinforcement_Learning, NLP
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for computing verifiable rewards from multiple reward functions, provided by ColossalChat.

Description

RLVRRewardModel wraps a list of callable reward functions, applying each to generated responses and aggregating their scores. The VerifiableReward class in the distributed RL pipeline provides a similar interface with support for gt_answer-based and test_cases-based reward functions.

Usage

Create with a list of reward functions (e.g., math_reward, code_reward) and call with generated responses and ground-truth answers.

Code Reference

Source Location

Repository: ColossalAI
File (RLVRRewardModel): applications/ColossalChat/coati/models/rlvr_reward_model.py
Lines: 10-50
File (VerifiableReward): applications/ColossalChat/coati/distributed/reward/verifiable_reward.py
Lines: 11-71

Signature

class RLVRRewardModel:
    def __init__(self, reward_fn_list: List[Callable], **kwargs) -> None:
        """
        Args:
            reward_fn_list: List of reward functions
            **kwargs: Additional keyword args for reward functions
        """

    def __call__(
        self,
        input_ids: torch.LongTensor,
        attention_mask: Optional[torch.Tensor] = None,
        response_start: List = None,
        response_end: List = None,
        gt_answer: List = None,
    ) -> torch.Tensor:
        """Compute rewards for each sample using all reward functions."""

class VerifiableReward:
    def __init__(self, reward_fns: List[callable], **kwargs):
        """Distributed version with support for gt_answer and test_cases."""

    def __call__(
        self,
        input_ids: torch.LongTensor,
        gt_answer: List[str] = None,
        test_cases: List[str] = None,
        response_idx: List[torch.Tensor] = None,
    ) -> torch.Tensor:
        """Returns tensor of shape (batch_size, 3) with reward scores."""

Import

from coati.models.rlvr_reward_model import RLVRRewardModel
from coati.distributed.reward.verifiable_reward import VerifiableReward
from coati.distributed.reward.reward_fn import math_reward, code_reward

I/O Contract

Inputs

Name	Type	Required	Description
reward_fn_list	List[Callable]	Yes	List of reward functions (math_reward, code_reward, etc.)
input_ids	torch.LongTensor	Yes	Tokenized responses to evaluate
gt_answer	List[str]	No	Ground-truth answers for verification
test_cases	List[str]	No	Code test cases for execution-based verification

Outputs

Name	Type	Description
rewards	torch.Tensor	Reward scores per sample (shape: [batch_size] or [batch_size, num_fns])

Usage Examples

from coati.distributed.reward.verifiable_reward import VerifiableReward
from coati.distributed.reward.reward_fn import math_reward

# Create verifiable reward with math checking
reward_model = VerifiableReward(
    reward_fns=[math_reward],
)

# Score generated responses
rewards = reward_model(
    input_ids=generated_ids,
    gt_answer=ground_truth_answers,
    response_idx=response_indices,
)

Related Pages

Implements Principle

Principle:Hpcaitech_ColossalAI_GRPO_Reward_Configuration

Environment and Heuristic Links

Environment:Hpcaitech_ColossalAI_GRPO_Distributed_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment