Implementation:Hpcaitech ColossalAI RewardModel

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Reinforcement Learning, RLHF, Reward Modeling
Last Updated	2026-02-09 00:00 GMT

Overview

Reward model for RLHF training that produces a scalar reward score for each sequence in a batch.

Description

The RewardModel class extends BaseModel by adding a linear value head that maps the hidden state at the last non-padding token position to a scalar reward score. The value head weights are initialized with a normal distribution scaled by 1/(hidden_size + 1). The forward method computes sequence lengths from the attention mask, extracts the corresponding hidden states, and produces reward values with shape (B,). This model is used to score complete sequences during PPO training and reward model pre-training.

Usage

Use this model to train a reward model from human preference data or to provide reward signals during PPO-based RLHF training in the ColossalChat pipeline.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalChat/coati/models/reward_model.py
Lines: 1-47

Signature

class RewardModel(BaseModel):
    def __init__(self, pretrained: str = None, config: Optional[PretrainedConfig] = None, **kwargs) -> None:

    def forward(
        self, input_ids: torch.LongTensor, attention_mask: Optional[torch.Tensor] = None, **kwargs
    ) -> torch.Tensor:

    def get_input_embeddings(self):

    def get_output_embeddings(self):

Import

from coati.models.reward_model import RewardModel

I/O Contract

Inputs (forward)

Name	Type	Required	Description
input_ids	torch.LongTensor	Yes	Input token IDs of shape (B, S)
attention_mask	torch.Tensor	No	Attention mask of shape (B, S) used to find the last non-padding token

Outputs (forward)

Name	Type	Description
values	torch.Tensor	Scalar reward values of shape (B,), one per sequence

Usage Examples

from coati.models.reward_model import RewardModel
import torch

# Initialize reward model
reward_model = RewardModel(pretrained="meta-llama/Llama-2-7b-hf")
reward_model = reward_model.cuda().eval()

# Score sequences
input_ids = torch.randint(0, 32000, (4, 256)).cuda()
attention_mask = torch.ones(4, 256).cuda()
rewards = reward_model(input_ids, attention_mask=attention_mask)
print(rewards.shape)  # (4,)

Related Pages

Environment:Hpcaitech_ColossalAI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment