Principle:CarperAI Trlx Online RL Training

Knowledge Sources	Proximal Policy Optimization Algorithms Learning to Summarize from Human Feedback Fine-Tuning Language Models from Human Preferences CarperAI trlx
Domains	Reinforcement_Learning, NLP, Training
Last Updated	2026-02-07 16:00 GMT

Overview

A training principle for optimizing language models via on-policy reinforcement learning using PPO with a live reward function.

Description

Online RL training is the core RLHF loop: the language model generates completions from prompts, a reward function scores them, and PPO updates the model parameters to maximize expected reward while staying close to a reference policy. This is "online" because the model generates fresh samples at each step and immediately uses them for optimization (on-policy learning).

The training loop involves: (1) rollout — generate completions and score with reward function, (2) advantage estimation — compute GAE advantages from rewards and value estimates, (3) PPO update — multiple optimization epochs over the rollout batch with clipped surrogate objective, and (4) reference model KL penalty — prevent reward hacking by penalizing divergence from the initial policy.

Usage

Use online RL training when you have a live reward function that can score generated text in real time. This is the standard approach for RLHF when: a trained reward model is available, a rule-based quality measure exists (sentiment, toxicity), or an automated evaluation system can provide scores. Online RL is preferred over offline RL when fresh on-policy samples are important for exploration.

Theoretical Basis

The PPO training objective for language models:

$\max_{θ} E_{x \sim D, y \sim π_{θ}} [R (x, y) - β \cdot K L (π_{θ} ‖ π_{r e f})]$

The training loop per batch:

Pseudo-code:

# Abstract PPO training loop (not real implementation)
for batch in prompts:
    # 1. Rollout: generate and score
    completions = model.generate(batch)
    rewards = reward_fn(completions)

    # 2. Compute advantages
    values = value_head(hidden_states)
    advantages = gae(rewards, values, gamma, lam)

    # 3. PPO update (multiple epochs)
    for epoch in range(ppo_epochs):
        ratio = pi_new / pi_old
        clipped = clip(ratio, 1-eps, 1+eps) * advantages
        loss = -min(ratio * advantages, clipped) + vf_coef * value_loss

    # 4. KL penalty (adaptive)
    kl = compute_kl(pi_new, pi_ref)
    kl_coef = update_kl_coef(kl, target_kl)

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment