Implementation:Hpcaitech ColossalAI NaiveExperienceMaker

Knowledge Sources	Hpcaitech_ColossalAI
Domains	RLHF, PPO, GRPO, Experience_Generation
Last Updated	2026-02-09 00:00 GMT

Overview

naive.py implements the NaiveExperienceMaker class, which generates rollout experiences for PPO and GRPO training by combining model generation, reward computation, and advantage calculation.

Description

NaiveExperienceMaker extends the ExperienceMaker base class to produce Experience objects used in reinforcement learning from human feedback. The make_experience method generates text sequences from the actor model, computes action log probabilities from both the policy and initial (reference) models, evaluates rewards via a reward model, and calculates advantages. For PPO mode, it uses a critic model to estimate values and computes Generalized Advantage Estimation (GAE) via the calculate_advantage method. For GRPO mode, it generates multiple responses per prompt, normalizes rewards within each group (mean/std), and uses the group-relative advantage as the training signal with KL divergence penalties. The class handles left-padding, stop token detection, and conversion between left-padded (generation) and right-padded (reward/critic) formats. Inference is done in configurable mini-batches to manage GPU memory.

Usage

Use this class in PPO or GRPO training pipelines to generate experience tuples (sequences, action log probs, values, rewards, KL divergence, advantages, attention masks, action masks) from prompts. It is typically used within a PPO trainer loop that alternates between experience generation and policy optimization.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalChat/coati/experience_maker/naive.py
Lines: 1-308

Signature

class NaiveExperienceMaker(ExperienceMaker):
    def __init__(
        self,
        actor: PreTrainedModel,
        critic: Critic,
        reward_model: RewardModel,
        initial_model: PreTrainedModel,
        tokenizer: PreTrainedTokenizer,
        kl_coef: float = 0.01,
        gamma: float = 1.0,
        lam: float = 0.95,
        use_grpo: bool = False,
        num_generation: int = 8,
        inference_batch_size: int = None,
        logits_forward_batch_size: int = 2,
    ) -> None

Key Methods

@torch.inference_mode()
def calculate_advantage(
    self,
    value: torch.Tensor,
    reward: torch.Tensor,
    num_actions: int,
) -> torch.Tensor

@torch.no_grad()
def make_experience(
    self,
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor,
    gt_answer: Any = None,
    **generate_kwargs,
) -> Experience

Import

from coati.experience_maker.naive import NaiveExperienceMaker

I/O Contract

Inputs (init)

Name	Type	Required	Description
actor	PreTrainedModel	Yes	The policy model used for text generation
critic	Critic	Conditional	The critic model for value estimation (required for PPO, None for GRPO)
reward_model	RewardModel	Yes	Model for computing reward scores
initial_model	PreTrainedModel	Yes	Reference model for KL divergence computation
tokenizer	PreTrainedTokenizer	Yes	Tokenizer for decoding and padding
kl_coef	float	No	KL divergence penalty coefficient (default: 0.01)
gamma	float	No	Discount factor for GAE (default: 1.0)
lam	float	No	Lambda for GAE (default: 0.95)
use_grpo	bool	No	Use GRPO instead of PPO advantage calculation (default: False)
num_generation	int	No	Number of generations per prompt for GRPO (default: 8)
inference_batch_size	int	No	Batch size for inference mini-batches (default: full batch)
logits_forward_batch_size	int	No	Batch size for logits forward passes (default: 2)

Inputs (make_experience)

Name	Type	Required	Description
input_ids	torch.Tensor	Yes	Tokenized input prompts [batch_size, seq_len]
attention_mask	torch.Tensor	Yes	Attention mask for input prompts
gt_answer	Any	No	Ground truth answers for reward computation
**generate_kwargs	dict	No	Generation parameters (max_length, stop_token_ids, use_cache, etc.)

Outputs (make_experience)

Name	Type	Description
Experience	Experience	Named tuple containing sequences, action_log_probs, value, reward, kl, advantages, attention_mask, action_mask (all on CPU)

Usage Examples

from coati.experience_maker.naive import NaiveExperienceMaker

# PPO mode
experience_maker = NaiveExperienceMaker(
    actor=actor_model,
    critic=critic_model,
    reward_model=reward_model,
    initial_model=ref_model,
    tokenizer=tokenizer,
    kl_coef=0.01,
    gamma=1.0,
    lam=0.95,
    use_grpo=False,
)

experience = experience_maker.make_experience(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=2048,
    stop_token_ids=[[tokenizer.eos_token_id]],
)

# GRPO mode
experience_maker_grpo = NaiveExperienceMaker(
    actor=actor_model,
    critic=None,
    reward_model=reward_model,
    initial_model=ref_model,
    tokenizer=tokenizer,
    use_grpo=True,
    num_generation=8,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment