Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI NaiveExperienceMaker

From Leeroopedia


Knowledge Sources
Domains RLHF, PPO, GRPO, Experience_Generation
Last Updated 2026-02-09 00:00 GMT

Overview

naive.py implements the NaiveExperienceMaker class, which generates rollout experiences for PPO and GRPO training by combining model generation, reward computation, and advantage calculation.

Description

NaiveExperienceMaker extends the ExperienceMaker base class to produce Experience objects used in reinforcement learning from human feedback. The make_experience method generates text sequences from the actor model, computes action log probabilities from both the policy and initial (reference) models, evaluates rewards via a reward model, and calculates advantages. For PPO mode, it uses a critic model to estimate values and computes Generalized Advantage Estimation (GAE) via the calculate_advantage method. For GRPO mode, it generates multiple responses per prompt, normalizes rewards within each group (mean/std), and uses the group-relative advantage as the training signal with KL divergence penalties. The class handles left-padding, stop token detection, and conversion between left-padded (generation) and right-padded (reward/critic) formats. Inference is done in configurable mini-batches to manage GPU memory.

Usage

Use this class in PPO or GRPO training pipelines to generate experience tuples (sequences, action log probs, values, rewards, KL divergence, advantages, attention masks, action masks) from prompts. It is typically used within a PPO trainer loop that alternates between experience generation and policy optimization.

Code Reference

Source Location

Signature

class NaiveExperienceMaker(ExperienceMaker):
    def __init__(
        self,
        actor: PreTrainedModel,
        critic: Critic,
        reward_model: RewardModel,
        initial_model: PreTrainedModel,
        tokenizer: PreTrainedTokenizer,
        kl_coef: float = 0.01,
        gamma: float = 1.0,
        lam: float = 0.95,
        use_grpo: bool = False,
        num_generation: int = 8,
        inference_batch_size: int = None,
        logits_forward_batch_size: int = 2,
    ) -> None

Key Methods

@torch.inference_mode()
def calculate_advantage(
    self,
    value: torch.Tensor,
    reward: torch.Tensor,
    num_actions: int,
) -> torch.Tensor

@torch.no_grad()
def make_experience(
    self,
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor,
    gt_answer: Any = None,
    **generate_kwargs,
) -> Experience

Import

from coati.experience_maker.naive import NaiveExperienceMaker

I/O Contract

Inputs (__init__)

Name Type Required Description
actor PreTrainedModel Yes The policy model used for text generation
critic Critic Conditional The critic model for value estimation (required for PPO, None for GRPO)
reward_model RewardModel Yes Model for computing reward scores
initial_model PreTrainedModel Yes Reference model for KL divergence computation
tokenizer PreTrainedTokenizer Yes Tokenizer for decoding and padding
kl_coef float No KL divergence penalty coefficient (default: 0.01)
gamma float No Discount factor for GAE (default: 1.0)
lam float No Lambda for GAE (default: 0.95)
use_grpo bool No Use GRPO instead of PPO advantage calculation (default: False)
num_generation int No Number of generations per prompt for GRPO (default: 8)
inference_batch_size int No Batch size for inference mini-batches (default: full batch)
logits_forward_batch_size int No Batch size for logits forward passes (default: 2)

Inputs (make_experience)

Name Type Required Description
input_ids torch.Tensor Yes Tokenized input prompts [batch_size, seq_len]
attention_mask torch.Tensor Yes Attention mask for input prompts
gt_answer Any No Ground truth answers for reward computation
**generate_kwargs dict No Generation parameters (max_length, stop_token_ids, use_cache, etc.)

Outputs (make_experience)

Name Type Description
Experience Experience Named tuple containing sequences, action_log_probs, value, reward, kl, advantages, attention_mask, action_mask (all on CPU)

Usage Examples

from coati.experience_maker.naive import NaiveExperienceMaker

# PPO mode
experience_maker = NaiveExperienceMaker(
    actor=actor_model,
    critic=critic_model,
    reward_model=reward_model,
    initial_model=ref_model,
    tokenizer=tokenizer,
    kl_coef=0.01,
    gamma=1.0,
    lam=0.95,
    use_grpo=False,
)

experience = experience_maker.make_experience(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=2048,
    stop_token_ids=[[tokenizer.eos_token_id]],
)

# GRPO mode
experience_maker_grpo = NaiveExperienceMaker(
    actor=actor_model,
    critic=None,
    reward_model=reward_model,
    initial_model=ref_model,
    tokenizer=tokenizer,
    use_grpo=True,
    num_generation=8,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment