Implementation:Hpcaitech ColossalAI ExperienceMaker Base
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning, RLHF, PPO |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Base classes for the Experience dataclass and the ExperienceMaker abstract interface used in PPO-based RLHF training.
Description
This module defines two core components of the ColossalChat RLHF system. The Experience dataclass holds a batch of PPO experience data including sequences, action log probabilities, values, rewards, KL divergences, advantages, and attention/action masks, with methods for device transfer (to_device) and memory pinning (pin_memory). The ExperienceMaker abstract base class defines the interface for generating experience data, holding references to the actor model, critic model, reward model, and initial (reference) model, with a single abstract method make_experience that subclasses must implement.
Usage
Use Experience as the standard data container throughout the PPO training pipeline. Subclass ExperienceMaker to implement custom experience generation strategies that coordinate the actor, critic, reward, and reference models.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/coati/experience_maker/base.py
- Lines: 1-90
Signature
@dataclass
class Experience:
sequences: torch.Tensor
action_log_probs: torch.Tensor
values: torch.Tensor
reward: torch.Tensor
kl: torch.Tensor
advantages: torch.Tensor
attention_mask: Optional[torch.LongTensor]
action_mask: Optional[torch.BoolTensor]
@torch.no_grad()
def to_device(self, device: torch.device) -> None:
def pin_memory(self):
class ExperienceMaker(ABC):
def __init__(
self, actor: PreTrainedModel, critic: Critic,
reward_model: RewardModel, initial_model: PreTrainedModel
) -> None:
@abstractmethod
def make_experience(
self, input_ids: torch.Tensor, attention_mask: torch.Tensor, **generate_kwargs
) -> Experience:
Import
from coati.experience_maker.base import Experience, ExperienceMaker
I/O Contract
Inputs (ExperienceMaker.__init__)
| Name | Type | Required | Description |
|---|---|---|---|
| actor | PreTrainedModel | Yes | The actor (policy) model for generating sequences |
| critic | Critic | Yes | The critic model for value estimation |
| reward_model | RewardModel | Yes | The reward model for computing rewards |
| initial_model | PreTrainedModel | Yes | The reference/initial model for KL divergence computation |
Inputs (make_experience)
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.Tensor | Yes | Input token IDs (prompts) |
| attention_mask | torch.Tensor | Yes | Attention mask for the input |
| **generate_kwargs | dict | No | Additional generation parameters |
Outputs (make_experience)
| Name | Type | Description |
|---|---|---|
| return | Experience | A batch of experience data with sequences, log probs, values, rewards, KL, advantages, and masks |
Usage Examples
from coati.experience_maker.base import Experience, ExperienceMaker
import torch
# Experience is used as data container
experience = Experience(
sequences=sequences_tensor,
action_log_probs=log_probs_tensor,
values=values_tensor,
reward=reward_tensor,
kl=kl_tensor,
advantages=advantages_tensor,
attention_mask=attn_mask,
action_mask=act_mask,
)
# Move experience to a specific device
experience.to_device(torch.device("cuda:0"))
# Pin memory for faster host-to-device transfer
experience.pin_memory()