Implementation:Hpcaitech ColossalAI Math Competition Reward

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Reinforcement Learning, RLHF, Math Reasoning
Last Updated	2026-02-09 00:00 GMT

Overview

A verifiable reward function for math competition tasks that scores model responses based on format validity and answer correctness.

Description

This module implements the math_competition_reward_fn function, a reward function designed for RLHF training on math competition problems. It awards 1 point for correctly formatted responses (validated via validate_response_structure) and an additional 9 points (total 10) if the extracted final answer matches the ground truth answer after normalization (stripping whitespace, removing spaces, and lowercasing). If no ground truth answer is provided, the reward is zero. The function uses extract_solution from the reward utilities to parse the model's answer from XML-style tags.

Usage

Use this function as a reward function in the RLVRRewardModel when training models on math competition or reasoning tasks where answers can be verified against known ground truth.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalChat/coati/utils/reward_score/competition.py
Lines: 1-26

Signature

def math_competition_reward_fn(input_ids, attention_mask, **kwargs):

Import

from coati.utils.reward_score.competition import math_competition_reward_fn

I/O Contract

Inputs

Name	Type	Required	Description
input_ids	torch.Tensor	Yes	Token IDs for a single sequence
attention_mask	torch.Tensor	Yes	Attention mask for the sequence
gt_answer	str	Yes (via kwargs)	Ground truth answer string; None results in zero reward
tokenizer	PreTrainedTokenizer	Yes (via kwargs)	Tokenizer for decoding token IDs
response_start	int	Yes (via kwargs)	Start index of the response in the token sequence
response_end	int	Yes (via kwargs)	End index of the response in the token sequence
tags	Dict	Yes (via kwargs)	Tag configuration for response structure validation

Outputs

Name	Type	Description
reward	torch.Tensor	Scalar reward: 0.0 (no answer/bad format), 1.0 (correct format), or 10.0 (correct answer)

Usage Examples

from coati.utils.reward_score.competition import math_competition_reward_fn
from coati.models.rlvr_reward_model import RLVRRewardModel

# Use as part of an RLVR reward model
reward_model = RLVRRewardModel(
    reward_fn_list=[math_competition_reward_fn],
    tokenizer=tokenizer,
    tags={
        "think_start": {"text": "<think>", "num_occur": 1},
        "think_end": {"text": "</think>", "num_occur": 1},
        "answer_start": {"text": "<answer>", "num_occur": 1},
        "answer_end": {"text": "</answer>", "num_occur": 1},
    },
)

# Compute rewards for a batch
rewards = reward_model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    response_start=starts,
    response_end=ends,
    gt_answer=ground_truths,
)

Related Pages

Environment:Hpcaitech_ColossalAI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment