Implementation:Hpcaitech ColossalAI RewardModel
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning, RLHF, Reward Modeling |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Reward model for RLHF training that produces a scalar reward score for each sequence in a batch.
Description
The RewardModel class extends BaseModel by adding a linear value head that maps the hidden state at the last non-padding token position to a scalar reward score. The value head weights are initialized with a normal distribution scaled by 1/(hidden_size + 1). The forward method computes sequence lengths from the attention mask, extracts the corresponding hidden states, and produces reward values with shape (B,). This model is used to score complete sequences during PPO training and reward model pre-training.
Usage
Use this model to train a reward model from human preference data or to provide reward signals during PPO-based RLHF training in the ColossalChat pipeline.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/coati/models/reward_model.py
- Lines: 1-47
Signature
class RewardModel(BaseModel):
def __init__(self, pretrained: str = None, config: Optional[PretrainedConfig] = None, **kwargs) -> None:
def forward(
self, input_ids: torch.LongTensor, attention_mask: Optional[torch.Tensor] = None, **kwargs
) -> torch.Tensor:
def get_input_embeddings(self):
def get_output_embeddings(self):
Import
from coati.models.reward_model import RewardModel
I/O Contract
Inputs (forward)
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.LongTensor | Yes | Input token IDs of shape (B, S) |
| attention_mask | torch.Tensor | No | Attention mask of shape (B, S) used to find the last non-padding token |
Outputs (forward)
| Name | Type | Description |
|---|---|---|
| values | torch.Tensor | Scalar reward values of shape (B,), one per sequence |
Usage Examples
from coati.models.reward_model import RewardModel
import torch
# Initialize reward model
reward_model = RewardModel(pretrained="meta-llama/Llama-2-7b-hf")
reward_model = reward_model.cuda().eval()
# Score sequences
input_ids = torch.randint(0, 32000, (4, 256)).cuda()
attention_mask = torch.ones(4, 256).cuda()
rewards = reward_model(input_ids, attention_mask=attention_mask)
print(rewards.shape) # (4,)