Implementation:Hpcaitech ColossalAI Model Utils
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning, RLHF, Model Utilities |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Model utility functions for ColossalChat covering reward computation, log probability calculation, masking, dropout control, and JSON I/O.
Description
This module provides essential utility functions used throughout the ColossalChat training pipeline. compute_reward calculates per-token rewards with KL penalty, combining extrinsic rewards (clipped to a configurable range) with a KL divergence penalty between the policy and reference model. calc_action_log_probs and calc_masked_log_probs compute log probabilities from model logits for specific actions or masked positions. masked_mean computes the mean of a tensor along a dimension while ignoring masked positions. Additional utilities include get_model_numel for counting model parameters, disable_dropout for disabling dropout during PPO training, repad_to_left for converting right-padded sequences to left-padded, and load_json/save_json for file I/O.
Usage
Use these utilities during PPO training for reward shaping, advantage computation, and log probability extraction. They are integral to the ColossalChat RLHF training loop.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/coati/models/utils.py
- Lines: 1-158
Signature
def get_model_numel(model: torch.nn.Module) -> int:
def compute_reward(
r: Union[torch.Tensor, float],
kl_coef: float,
log_probs: torch.Tensor,
log_probs_base: torch.Tensor,
action_mask: Optional[torch.Tensor] = None,
reward_eps=5,
) -> torch.Tensor:
def calc_action_log_probs(
logits: torch.Tensor, sequences: torch.LongTensor, num_actions: int
) -> torch.Tensor:
def masked_mean(tensor: torch.Tensor, mask: torch.Tensor, dim: int = 1) -> torch.Tensor:
def calc_masked_log_probs(
logits: torch.Tensor, sequences: torch.LongTensor, mask: torch.Tensor,
length_normalization: bool = False
) -> torch.Tensor:
def load_json(file_path: Union[str, os.PathLike]) -> Dict[str, Any]:
def save_json(data: Dict[str, Any], file_path: Union[str, os.PathLike]) -> None:
def disable_dropout(model: torch.nn.Module):
def repad_to_left(tensor, tokenizer):
Import
from coati.models.utils import (
compute_reward, calc_action_log_probs, masked_mean,
calc_masked_log_probs, get_model_numel, disable_dropout,
load_json, save_json, repad_to_left,
)
I/O Contract
Inputs (compute_reward)
| Name | Type | Required | Description |
|---|---|---|---|
| r | Union[torch.Tensor, float] | Yes | Extrinsic reward signal |
| kl_coef | float | Yes | KL penalty coefficient |
| log_probs | torch.Tensor | Yes | Log probabilities from the policy model, shape [batch_size, response_length] |
| log_probs_base | torch.Tensor | Yes | Log probabilities from the reference model, shape [batch_size, response_length] |
| action_mask | torch.Tensor | No | Mask for valid actions, shape [batch_size, response_length] |
| reward_eps | float | No | Clipping range for reward, defaults to 5 |
Outputs (compute_reward)
| Name | Type | Description |
|---|---|---|
| reward | torch.Tensor | Per-token rewards with KL penalty, shape [batch_size, response_length] |
| kl | torch.Tensor | Approximate KL divergence per token |
Usage Examples
from coati.models.utils import compute_reward, calc_action_log_probs, disable_dropout
# Compute reward with KL penalty
reward, kl = compute_reward(
r=reward_scores,
kl_coef=0.1,
log_probs=policy_log_probs,
log_probs_base=ref_log_probs,
action_mask=action_mask,
)
# Calculate action log probabilities
log_probs = calc_action_log_probs(logits, sequences, num_actions=128)
# Disable dropout for PPO training
disable_dropout(model)