Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI Model Utils

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning, RLHF, Model Utilities
Last Updated 2026-02-09 00:00 GMT

Overview

Model utility functions for ColossalChat covering reward computation, log probability calculation, masking, dropout control, and JSON I/O.

Description

This module provides essential utility functions used throughout the ColossalChat training pipeline. compute_reward calculates per-token rewards with KL penalty, combining extrinsic rewards (clipped to a configurable range) with a KL divergence penalty between the policy and reference model. calc_action_log_probs and calc_masked_log_probs compute log probabilities from model logits for specific actions or masked positions. masked_mean computes the mean of a tensor along a dimension while ignoring masked positions. Additional utilities include get_model_numel for counting model parameters, disable_dropout for disabling dropout during PPO training, repad_to_left for converting right-padded sequences to left-padded, and load_json/save_json for file I/O.

Usage

Use these utilities during PPO training for reward shaping, advantage computation, and log probability extraction. They are integral to the ColossalChat RLHF training loop.

Code Reference

Source Location

Signature

def get_model_numel(model: torch.nn.Module) -> int:

def compute_reward(
    r: Union[torch.Tensor, float],
    kl_coef: float,
    log_probs: torch.Tensor,
    log_probs_base: torch.Tensor,
    action_mask: Optional[torch.Tensor] = None,
    reward_eps=5,
) -> torch.Tensor:

def calc_action_log_probs(
    logits: torch.Tensor, sequences: torch.LongTensor, num_actions: int
) -> torch.Tensor:

def masked_mean(tensor: torch.Tensor, mask: torch.Tensor, dim: int = 1) -> torch.Tensor:

def calc_masked_log_probs(
    logits: torch.Tensor, sequences: torch.LongTensor, mask: torch.Tensor,
    length_normalization: bool = False
) -> torch.Tensor:

def load_json(file_path: Union[str, os.PathLike]) -> Dict[str, Any]:

def save_json(data: Dict[str, Any], file_path: Union[str, os.PathLike]) -> None:

def disable_dropout(model: torch.nn.Module):

def repad_to_left(tensor, tokenizer):

Import

from coati.models.utils import (
    compute_reward, calc_action_log_probs, masked_mean,
    calc_masked_log_probs, get_model_numel, disable_dropout,
    load_json, save_json, repad_to_left,
)

I/O Contract

Inputs (compute_reward)

Name Type Required Description
r Union[torch.Tensor, float] Yes Extrinsic reward signal
kl_coef float Yes KL penalty coefficient
log_probs torch.Tensor Yes Log probabilities from the policy model, shape [batch_size, response_length]
log_probs_base torch.Tensor Yes Log probabilities from the reference model, shape [batch_size, response_length]
action_mask torch.Tensor No Mask for valid actions, shape [batch_size, response_length]
reward_eps float No Clipping range for reward, defaults to 5

Outputs (compute_reward)

Name Type Description
reward torch.Tensor Per-token rewards with KL penalty, shape [batch_size, response_length]
kl torch.Tensor Approximate KL divergence per token

Usage Examples

from coati.models.utils import compute_reward, calc_action_log_probs, disable_dropout

# Compute reward with KL penalty
reward, kl = compute_reward(
    r=reward_scores,
    kl_coef=0.1,
    log_probs=policy_log_probs,
    log_probs_base=ref_log_probs,
    action_mask=action_mask,
)

# Calculate action log probabilities
log_probs = calc_action_log_probs(logits, sequences, num_actions=128)

# Disable dropout for PPO training
disable_dropout(model)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment