Implementation:Hpcaitech ColossalAI Math Competition Reward
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning, RLHF, Math Reasoning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A verifiable reward function for math competition tasks that scores model responses based on format validity and answer correctness.
Description
This module implements the math_competition_reward_fn function, a reward function designed for RLHF training on math competition problems. It awards 1 point for correctly formatted responses (validated via validate_response_structure) and an additional 9 points (total 10) if the extracted final answer matches the ground truth answer after normalization (stripping whitespace, removing spaces, and lowercasing). If no ground truth answer is provided, the reward is zero. The function uses extract_solution from the reward utilities to parse the model's answer from XML-style tags.
Usage
Use this function as a reward function in the RLVRRewardModel when training models on math competition or reasoning tasks where answers can be verified against known ground truth.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/coati/utils/reward_score/competition.py
- Lines: 1-26
Signature
def math_competition_reward_fn(input_ids, attention_mask, **kwargs):
Import
from coati.utils.reward_score.competition import math_competition_reward_fn
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.Tensor | Yes | Token IDs for a single sequence |
| attention_mask | torch.Tensor | Yes | Attention mask for the sequence |
| gt_answer | str | Yes (via kwargs) | Ground truth answer string; None results in zero reward |
| tokenizer | PreTrainedTokenizer | Yes (via kwargs) | Tokenizer for decoding token IDs |
| response_start | int | Yes (via kwargs) | Start index of the response in the token sequence |
| response_end | int | Yes (via kwargs) | End index of the response in the token sequence |
| tags | Dict | Yes (via kwargs) | Tag configuration for response structure validation |
Outputs
| Name | Type | Description |
|---|---|---|
| reward | torch.Tensor | Scalar reward: 0.0 (no answer/bad format), 1.0 (correct format), or 10.0 (correct answer) |
Usage Examples
from coati.utils.reward_score.competition import math_competition_reward_fn
from coati.models.rlvr_reward_model import RLVRRewardModel
# Use as part of an RLVR reward model
reward_model = RLVRRewardModel(
reward_fn_list=[math_competition_reward_fn],
tokenizer=tokenizer,
tags={
"think_start": {"text": "<think>", "num_occur": 1},
"think_end": {"text": "</think>", "num_occur": 1},
"answer_start": {"text": "<answer>", "num_occur": 1},
"answer_end": {"text": "</answer>", "num_occur": 1},
},
)
# Compute rewards for a batch
rewards = reward_model(
input_ids=input_ids,
attention_mask=attention_mask,
response_start=starts,
response_end=ends,
gt_answer=ground_truths,
)