Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI RLVRRewardModel Class

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning, RLHF, Verifiable Reward
Last Updated 2026-02-09 00:00 GMT

Overview

A verifiable reward model class that computes rewards by applying a list of callable reward functions to model outputs.

Description

RLVRRewardModel (Reinforcement Learning with Verifiable Reward) is a lightweight reward model class that does not use a neural network. Instead, it aggregates rewards from a configurable list of callable reward functions, each applied per-sample in the batch. The reward functions receive input IDs, attention masks, response boundaries, and ground truth answers, allowing rule-based or programmatic reward computation. The class implements to and eval as no-ops, making it compatible with the standard model interface expected by the ColossalChat training pipeline.

Usage

Use this model when training with verifiable rewards (e.g., math problem solving, code generation) where correctness can be determined programmatically rather than via a learned reward model.

Code Reference

Source Location

Signature

class RLVRRewardModel:
    def __init__(self, reward_fn_list: List[Callable], **kwargs) -> None:

    def __call__(
        self,
        input_ids: torch.LongTensor,
        attention_mask: Optional[torch.Tensor] = None,
        response_start: List = None,
        response_end: List = None,
        gt_answer: List = None,
    ) -> torch.Tensor:

    def to(self, device):

    def eval(self):

Import

from coati.models.rlvr_reward_model import RLVRRewardModel

I/O Contract

Inputs (__call__)

Name Type Required Description
input_ids torch.LongTensor Yes Input token IDs of shape (B, S)
attention_mask torch.Tensor No Attention mask of shape (B, S)
response_start List No Start positions of the response for each sample in the batch
response_end List No End positions of the response for each sample in the batch
gt_answer List No Ground truth answers for each sample in the batch

Outputs (__call__)

Name Type Description
rewards torch.Tensor Aggregated reward scores of shape (B,), one per sequence

Usage Examples

from coati.models.rlvr_reward_model import RLVRRewardModel

# Define a simple reward function
def accuracy_reward(input_ids, attention_mask, response_start, response_end, gt_answer, **kwargs):
    # Custom reward logic
    return 1.0 if answer_is_correct else 0.0

# Create RLVR reward model with multiple reward functions
reward_model = RLVRRewardModel(reward_fn_list=[accuracy_reward, format_reward])

# Compute rewards for a batch
rewards = reward_model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    response_start=response_starts,
    response_end=response_ends,
    gt_answer=ground_truths,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment