Implementation:Hpcaitech ColossalAI RLVRRewardModel Class

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Reinforcement Learning, RLHF, Verifiable Reward
Last Updated	2026-02-09 00:00 GMT

Overview

A verifiable reward model class that computes rewards by applying a list of callable reward functions to model outputs.

Description

RLVRRewardModel (Reinforcement Learning with Verifiable Reward) is a lightweight reward model class that does not use a neural network. Instead, it aggregates rewards from a configurable list of callable reward functions, each applied per-sample in the batch. The reward functions receive input IDs, attention masks, response boundaries, and ground truth answers, allowing rule-based or programmatic reward computation. The class implements to and eval as no-ops, making it compatible with the standard model interface expected by the ColossalChat training pipeline.

Usage

Use this model when training with verifiable rewards (e.g., math problem solving, code generation) where correctness can be determined programmatically rather than via a learned reward model.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalChat/coati/models/rlvr_reward_model.py
Lines: 1-50

Signature

class RLVRRewardModel:
    def __init__(self, reward_fn_list: List[Callable], **kwargs) -> None:

    def __call__(
        self,
        input_ids: torch.LongTensor,
        attention_mask: Optional[torch.Tensor] = None,
        response_start: List = None,
        response_end: List = None,
        gt_answer: List = None,
    ) -> torch.Tensor:

    def to(self, device):

    def eval(self):

Import

from coati.models.rlvr_reward_model import RLVRRewardModel

I/O Contract

Inputs (call)

Name	Type	Required	Description
input_ids	torch.LongTensor	Yes	Input token IDs of shape (B, S)
attention_mask	torch.Tensor	No	Attention mask of shape (B, S)
response_start	List	No	Start positions of the response for each sample in the batch
response_end	List	No	End positions of the response for each sample in the batch
gt_answer	List	No	Ground truth answers for each sample in the batch

Outputs (call)

Name	Type	Description
rewards	torch.Tensor	Aggregated reward scores of shape (B,), one per sequence

Usage Examples

from coati.models.rlvr_reward_model import RLVRRewardModel

# Define a simple reward function
def accuracy_reward(input_ids, attention_mask, response_start, response_end, gt_answer, **kwargs):
    # Custom reward logic
    return 1.0 if answer_is_correct else 0.0

# Create RLVR reward model with multiple reward functions
reward_model = RLVRRewardModel(reward_fn_list=[accuracy_reward, format_reward])

# Compute rewards for a batch
rewards = reward_model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    response_start=response_starts,
    response_end=response_ends,
    gt_answer=ground_truths,
)

Related Pages

Environment:Hpcaitech_ColossalAI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment