Implementation:Hpcaitech ColossalAI RLVRRewardModel Class
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning, RLHF, Verifiable Reward |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A verifiable reward model class that computes rewards by applying a list of callable reward functions to model outputs.
Description
RLVRRewardModel (Reinforcement Learning with Verifiable Reward) is a lightweight reward model class that does not use a neural network. Instead, it aggregates rewards from a configurable list of callable reward functions, each applied per-sample in the batch. The reward functions receive input IDs, attention masks, response boundaries, and ground truth answers, allowing rule-based or programmatic reward computation. The class implements to and eval as no-ops, making it compatible with the standard model interface expected by the ColossalChat training pipeline.
Usage
Use this model when training with verifiable rewards (e.g., math problem solving, code generation) where correctness can be determined programmatically rather than via a learned reward model.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/coati/models/rlvr_reward_model.py
- Lines: 1-50
Signature
class RLVRRewardModel:
def __init__(self, reward_fn_list: List[Callable], **kwargs) -> None:
def __call__(
self,
input_ids: torch.LongTensor,
attention_mask: Optional[torch.Tensor] = None,
response_start: List = None,
response_end: List = None,
gt_answer: List = None,
) -> torch.Tensor:
def to(self, device):
def eval(self):
Import
from coati.models.rlvr_reward_model import RLVRRewardModel
I/O Contract
Inputs (__call__)
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.LongTensor | Yes | Input token IDs of shape (B, S) |
| attention_mask | torch.Tensor | No | Attention mask of shape (B, S) |
| response_start | List | No | Start positions of the response for each sample in the batch |
| response_end | List | No | End positions of the response for each sample in the batch |
| gt_answer | List | No | Ground truth answers for each sample in the batch |
Outputs (__call__)
| Name | Type | Description |
|---|---|---|
| rewards | torch.Tensor | Aggregated reward scores of shape (B,), one per sequence |
Usage Examples
from coati.models.rlvr_reward_model import RLVRRewardModel
# Define a simple reward function
def accuracy_reward(input_ids, attention_mask, response_start, response_end, gt_answer, **kwargs):
# Custom reward logic
return 1.0 if answer_is_correct else 0.0
# Create RLVR reward model with multiple reward functions
reward_model = RLVRRewardModel(reward_fn_list=[accuracy_reward, format_reward])
# Compute rewards for a batch
rewards = reward_model(
input_ids=input_ids,
attention_mask=attention_mask,
response_start=response_starts,
response_end=response_ends,
gt_answer=ground_truths,
)