Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI RewardModel

From Leeroopedia
Revision as of 15:09, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Hpcaitech_ColossalAI_RewardModel.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Reinforcement Learning, RLHF, Reward Modeling
Last Updated 2026-02-09 00:00 GMT

Overview

Reward model for RLHF training that produces a scalar reward score for each sequence in a batch.

Description

The RewardModel class extends BaseModel by adding a linear value head that maps the hidden state at the last non-padding token position to a scalar reward score. The value head weights are initialized with a normal distribution scaled by 1/(hidden_size + 1). The forward method computes sequence lengths from the attention mask, extracts the corresponding hidden states, and produces reward values with shape (B,). This model is used to score complete sequences during PPO training and reward model pre-training.

Usage

Use this model to train a reward model from human preference data or to provide reward signals during PPO-based RLHF training in the ColossalChat pipeline.

Code Reference

Source Location

Signature

class RewardModel(BaseModel):
    def __init__(self, pretrained: str = None, config: Optional[PretrainedConfig] = None, **kwargs) -> None:

    def forward(
        self, input_ids: torch.LongTensor, attention_mask: Optional[torch.Tensor] = None, **kwargs
    ) -> torch.Tensor:

    def get_input_embeddings(self):

    def get_output_embeddings(self):

Import

from coati.models.reward_model import RewardModel

I/O Contract

Inputs (forward)

Name Type Required Description
input_ids torch.LongTensor Yes Input token IDs of shape (B, S)
attention_mask torch.Tensor No Attention mask of shape (B, S) used to find the last non-padding token

Outputs (forward)

Name Type Description
values torch.Tensor Scalar reward values of shape (B,), one per sequence

Usage Examples

from coati.models.reward_model import RewardModel
import torch

# Initialize reward model
reward_model = RewardModel(pretrained="meta-llama/Llama-2-7b-hf")
reward_model = reward_model.cuda().eval()

# Score sequences
input_ids = torch.randint(0, 32000, (4, 256)).cuda()
attention_mask = torch.ones(4, 256).cuda()
rewards = reward_model(input_ids, attention_mask=attention_mask)
print(rewards.shape)  # (4,)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment