Implementation:Hpcaitech ColossalAI NaiveExperienceMaker
| Knowledge Sources | |
|---|---|
| Domains | RLHF, PPO, GRPO, Experience_Generation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
naive.py implements the NaiveExperienceMaker class, which generates rollout experiences for PPO and GRPO training by combining model generation, reward computation, and advantage calculation.
Description
NaiveExperienceMaker extends the ExperienceMaker base class to produce Experience objects used in reinforcement learning from human feedback. The make_experience method generates text sequences from the actor model, computes action log probabilities from both the policy and initial (reference) models, evaluates rewards via a reward model, and calculates advantages. For PPO mode, it uses a critic model to estimate values and computes Generalized Advantage Estimation (GAE) via the calculate_advantage method. For GRPO mode, it generates multiple responses per prompt, normalizes rewards within each group (mean/std), and uses the group-relative advantage as the training signal with KL divergence penalties. The class handles left-padding, stop token detection, and conversion between left-padded (generation) and right-padded (reward/critic) formats. Inference is done in configurable mini-batches to manage GPU memory.
Usage
Use this class in PPO or GRPO training pipelines to generate experience tuples (sequences, action log probs, values, rewards, KL divergence, advantages, attention masks, action masks) from prompts. It is typically used within a PPO trainer loop that alternates between experience generation and policy optimization.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/coati/experience_maker/naive.py
- Lines: 1-308
Signature
class NaiveExperienceMaker(ExperienceMaker):
def __init__(
self,
actor: PreTrainedModel,
critic: Critic,
reward_model: RewardModel,
initial_model: PreTrainedModel,
tokenizer: PreTrainedTokenizer,
kl_coef: float = 0.01,
gamma: float = 1.0,
lam: float = 0.95,
use_grpo: bool = False,
num_generation: int = 8,
inference_batch_size: int = None,
logits_forward_batch_size: int = 2,
) -> None
Key Methods
@torch.inference_mode()
def calculate_advantage(
self,
value: torch.Tensor,
reward: torch.Tensor,
num_actions: int,
) -> torch.Tensor
@torch.no_grad()
def make_experience(
self,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
gt_answer: Any = None,
**generate_kwargs,
) -> Experience
Import
from coati.experience_maker.naive import NaiveExperienceMaker
I/O Contract
Inputs (__init__)
| Name | Type | Required | Description |
|---|---|---|---|
| actor | PreTrainedModel | Yes | The policy model used for text generation |
| critic | Critic | Conditional | The critic model for value estimation (required for PPO, None for GRPO) |
| reward_model | RewardModel | Yes | Model for computing reward scores |
| initial_model | PreTrainedModel | Yes | Reference model for KL divergence computation |
| tokenizer | PreTrainedTokenizer | Yes | Tokenizer for decoding and padding |
| kl_coef | float | No | KL divergence penalty coefficient (default: 0.01) |
| gamma | float | No | Discount factor for GAE (default: 1.0) |
| lam | float | No | Lambda for GAE (default: 0.95) |
| use_grpo | bool | No | Use GRPO instead of PPO advantage calculation (default: False) |
| num_generation | int | No | Number of generations per prompt for GRPO (default: 8) |
| inference_batch_size | int | No | Batch size for inference mini-batches (default: full batch) |
| logits_forward_batch_size | int | No | Batch size for logits forward passes (default: 2) |
Inputs (make_experience)
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.Tensor | Yes | Tokenized input prompts [batch_size, seq_len] |
| attention_mask | torch.Tensor | Yes | Attention mask for input prompts |
| gt_answer | Any | No | Ground truth answers for reward computation |
| **generate_kwargs | dict | No | Generation parameters (max_length, stop_token_ids, use_cache, etc.) |
Outputs (make_experience)
| Name | Type | Description |
|---|---|---|
| Experience | Experience | Named tuple containing sequences, action_log_probs, value, reward, kl, advantages, attention_mask, action_mask (all on CPU) |
Usage Examples
from coati.experience_maker.naive import NaiveExperienceMaker
# PPO mode
experience_maker = NaiveExperienceMaker(
actor=actor_model,
critic=critic_model,
reward_model=reward_model,
initial_model=ref_model,
tokenizer=tokenizer,
kl_coef=0.01,
gamma=1.0,
lam=0.95,
use_grpo=False,
)
experience = experience_maker.make_experience(
input_ids=input_ids,
attention_mask=attention_mask,
max_length=2048,
stop_token_ids=[[tokenizer.eos_token_id]],
)
# GRPO mode
experience_maker_grpo = NaiveExperienceMaker(
actor=actor_model,
critic=None,
reward_model=reward_model,
initial_model=ref_model,
tokenizer=tokenizer,
use_grpo=True,
num_generation=8,
)