Implementation:Hpcaitech ColossalAI Model Utils

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Reinforcement Learning, RLHF, Model Utilities
Last Updated	2026-02-09 00:00 GMT

Overview

Model utility functions for ColossalChat covering reward computation, log probability calculation, masking, dropout control, and JSON I/O.

Description

This module provides essential utility functions used throughout the ColossalChat training pipeline. compute_reward calculates per-token rewards with KL penalty, combining extrinsic rewards (clipped to a configurable range) with a KL divergence penalty between the policy and reference model. calc_action_log_probs and calc_masked_log_probs compute log probabilities from model logits for specific actions or masked positions. masked_mean computes the mean of a tensor along a dimension while ignoring masked positions. Additional utilities include get_model_numel for counting model parameters, disable_dropout for disabling dropout during PPO training, repad_to_left for converting right-padded sequences to left-padded, and load_json/save_json for file I/O.

Usage

Use these utilities during PPO training for reward shaping, advantage computation, and log probability extraction. They are integral to the ColossalChat RLHF training loop.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalChat/coati/models/utils.py
Lines: 1-158

Signature

def get_model_numel(model: torch.nn.Module) -> int:

def compute_reward(
    r: Union[torch.Tensor, float],
    kl_coef: float,
    log_probs: torch.Tensor,
    log_probs_base: torch.Tensor,
    action_mask: Optional[torch.Tensor] = None,
    reward_eps=5,
) -> torch.Tensor:

def calc_action_log_probs(
    logits: torch.Tensor, sequences: torch.LongTensor, num_actions: int
) -> torch.Tensor:

def masked_mean(tensor: torch.Tensor, mask: torch.Tensor, dim: int = 1) -> torch.Tensor:

def calc_masked_log_probs(
    logits: torch.Tensor, sequences: torch.LongTensor, mask: torch.Tensor,
    length_normalization: bool = False
) -> torch.Tensor:

def load_json(file_path: Union[str, os.PathLike]) -> Dict[str, Any]:

def save_json(data: Dict[str, Any], file_path: Union[str, os.PathLike]) -> None:

def disable_dropout(model: torch.nn.Module):

def repad_to_left(tensor, tokenizer):

Import

from coati.models.utils import (
    compute_reward, calc_action_log_probs, masked_mean,
    calc_masked_log_probs, get_model_numel, disable_dropout,
    load_json, save_json, repad_to_left,
)

I/O Contract

Inputs (compute_reward)

Name	Type	Required	Description
r	Union[torch.Tensor, float]	Yes	Extrinsic reward signal
kl_coef	float	Yes	KL penalty coefficient
log_probs	torch.Tensor	Yes	Log probabilities from the policy model, shape [batch_size, response_length]
log_probs_base	torch.Tensor	Yes	Log probabilities from the reference model, shape [batch_size, response_length]
action_mask	torch.Tensor	No	Mask for valid actions, shape [batch_size, response_length]
reward_eps	float	No	Clipping range for reward, defaults to 5

Outputs (compute_reward)

Name	Type	Description
reward	torch.Tensor	Per-token rewards with KL penalty, shape [batch_size, response_length]
kl	torch.Tensor	Approximate KL divergence per token

Usage Examples

from coati.models.utils import compute_reward, calc_action_log_probs, disable_dropout

# Compute reward with KL penalty
reward, kl = compute_reward(
    r=reward_scores,
    kl_coef=0.1,
    log_probs=policy_log_probs,
    log_probs_base=ref_log_probs,
    action_mask=action_mask,
)

# Calculate action log probabilities
log_probs = calc_action_log_probs(logits, sequences, num_actions=128)

# Disable dropout for PPO training
disable_dropout(model)

Related Pages

Environment:Hpcaitech_ColossalAI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment