Principle:CarperAI Trlx Online RL Training
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, NLP, Training |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
A training principle for optimizing language models via on-policy reinforcement learning using PPO with a live reward function.
Description
Online RL training is the core RLHF loop: the language model generates completions from prompts, a reward function scores them, and PPO updates the model parameters to maximize expected reward while staying close to a reference policy. This is "online" because the model generates fresh samples at each step and immediately uses them for optimization (on-policy learning).
The training loop involves: (1) rollout — generate completions and score with reward function, (2) advantage estimation — compute GAE advantages from rewards and value estimates, (3) PPO update — multiple optimization epochs over the rollout batch with clipped surrogate objective, and (4) reference model KL penalty — prevent reward hacking by penalizing divergence from the initial policy.
Usage
Use online RL training when you have a live reward function that can score generated text in real time. This is the standard approach for RLHF when: a trained reward model is available, a rule-based quality measure exists (sentiment, toxicity), or an automated evaluation system can provide scores. Online RL is preferred over offline RL when fresh on-policy samples are important for exploration.
Theoretical Basis
The PPO training objective for language models:
The training loop per batch:
Pseudo-code:
# Abstract PPO training loop (not real implementation)
for batch in prompts:
# 1. Rollout: generate and score
completions = model.generate(batch)
rewards = reward_fn(completions)
# 2. Compute advantages
values = value_head(hidden_states)
advantages = gae(rewards, values, gamma, lam)
# 3. PPO update (multiple epochs)
for epoch in range(ppo_epochs):
ratio = pi_new / pi_old
clipped = clip(ratio, 1-eps, 1+eps) * advantages
loss = -min(ratio * advantages, clipped) + vf_coef * value_loss
# 4. KL penalty (adaptive)
kl = compute_kl(pi_new, pi_ref)
kl_coef = update_kl_coef(kl, target_kl)
Related Pages
Implemented By
- Implementation:CarperAI_Trlx_Trlx_Train_Online
- Implementation:CarperAI_Trlx_NeMo_PPO_Model
- Implementation:CarperAI_Trlx_NeMo_PPO_Trainer