Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI Math Competition Reward

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning, RLHF, Math Reasoning
Last Updated 2026-02-09 00:00 GMT

Overview

A verifiable reward function for math competition tasks that scores model responses based on format validity and answer correctness.

Description

This module implements the math_competition_reward_fn function, a reward function designed for RLHF training on math competition problems. It awards 1 point for correctly formatted responses (validated via validate_response_structure) and an additional 9 points (total 10) if the extracted final answer matches the ground truth answer after normalization (stripping whitespace, removing spaces, and lowercasing). If no ground truth answer is provided, the reward is zero. The function uses extract_solution from the reward utilities to parse the model's answer from XML-style tags.

Usage

Use this function as a reward function in the RLVRRewardModel when training models on math competition or reasoning tasks where answers can be verified against known ground truth.

Code Reference

Source Location

Signature

def math_competition_reward_fn(input_ids, attention_mask, **kwargs):

Import

from coati.utils.reward_score.competition import math_competition_reward_fn

I/O Contract

Inputs

Name Type Required Description
input_ids torch.Tensor Yes Token IDs for a single sequence
attention_mask torch.Tensor Yes Attention mask for the sequence
gt_answer str Yes (via kwargs) Ground truth answer string; None results in zero reward
tokenizer PreTrainedTokenizer Yes (via kwargs) Tokenizer for decoding token IDs
response_start int Yes (via kwargs) Start index of the response in the token sequence
response_end int Yes (via kwargs) End index of the response in the token sequence
tags Dict Yes (via kwargs) Tag configuration for response structure validation

Outputs

Name Type Description
reward torch.Tensor Scalar reward: 0.0 (no answer/bad format), 1.0 (correct format), or 10.0 (correct answer)

Usage Examples

from coati.utils.reward_score.competition import math_competition_reward_fn
from coati.models.rlvr_reward_model import RLVRRewardModel

# Use as part of an RLVR reward model
reward_model = RLVRRewardModel(
    reward_fn_list=[math_competition_reward_fn],
    tokenizer=tokenizer,
    tags={
        "think_start": {"text": "<think>", "num_occur": 1},
        "think_end": {"text": "</think>", "num_occur": 1},
        "answer_start": {"text": "<answer>", "num_occur": 1},
        "answer_end": {"text": "</answer>", "num_occur": 1},
    },
)

# Compute rewards for a batch
rewards = reward_model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    response_start=starts,
    response_end=ends,
    gt_answer=ground_truths,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment