Workflow:CarperAI Trlx PPO Sentiment Training

Knowledge Sources	trlx Learning to Summarize from Human Feedback Proximal Policy Optimization trlx Documentation
Domains	LLMs, Reinforcement_Learning, RLHF
Last Updated	2026-02-07 16:00 GMT

Overview

End-to-end process for online reinforcement learning fine-tuning of language models using Proximal Policy Optimization (PPO) with a live reward function.

Description

This workflow demonstrates the canonical trlX use case: training a language model to generate text that maximizes a reward signal using PPO. The model generates text completions from prompts, a reward function scores them in real time, and PPO updates the model weights to increase expected reward while maintaining proximity to the original model via KL divergence penalties. The process uses the unified trlx.train() API with a reward function callback, handling configuration, prompt pipeline setup, rollout generation, and iterative policy optimization automatically.

Usage

Execute this workflow when you have a pretrained language model and a computable reward function (e.g., a sentiment classifier, toxicity detector, or any scoring model) and want the model to generate outputs that maximize that reward. This is the standard approach for online RLHF where rewards can be computed programmatically at training time.

Execution Steps

Step 1: Configure training

Set up the training configuration by loading a default PPO config and optionally overriding hyperparameters. The configuration specifies the base model, tokenizer, optimizer settings, learning rate schedule, and PPO-specific parameters (KL coefficient, clipping range, number of rollouts, generation settings). Configuration can be loaded from a default factory function, a YAML file, or constructed programmatically.

Key considerations:

Choose an appropriate base model (e.g., GPT-2 for prototyping, GPT-J-6B for production)
Set sequence length and generation parameters (max_new_tokens, top_k, top_p)
Tune PPO hyperparameters: init_kl_coef controls reward/KL trade-off, num_rollouts sets samples per batch
Reduce batch_size and num_layers_unfrozen if memory-constrained

Step 2: Define the reward function

Implement a callable that takes generated text samples and returns a list of scalar reward values. The reward function signature accepts samples (generated texts) and optional keyword arguments, returning one float per sample. Common implementations use a pretrained classifier pipeline, a reward model, or a rule-based scoring function.

Key considerations:

The reward function runs on every batch during training, so it must be efficient
Use batched inference for reward model evaluation
Consider placing the reward model on a separate GPU to avoid memory contention with the policy model
Reward normalization (running mean/std) can stabilize training

Step 3: Prepare prompts

Load or construct the prompt dataset that will be used to generate completions during training. Prompts are short text prefixes that the model will complete. An evaluation prompt set is also prepared for periodic validation. Prompts are tokenized and loaded into a PromptPipeline that handles batching and distribution across processes.

Key considerations:

Prompts should be representative of the target distribution
Evaluation prompts should be fixed across training for consistent metrics
Prompts longer than max_prompt_length (seq_length minus max_new_tokens) will be truncated

Step 4: Launch PPO training

Call the unified trlx.train() entry point with the reward function, prompts, evaluation prompts, and configuration. This dispatches to the AcceleratePPOTrainer, which orchestrates the training loop: generating rollouts from the current policy, scoring them with the reward function, computing PPO loss (policy gradient with clipping, value function loss, KL penalty), and updating model weights. Training progress is logged to Weights & Biases by default.

Key considerations:

Training alternates between generation (rollout) and optimization (PPO update) phases
A frozen reference model copy is maintained for KL divergence computation
The trainer supports DeepSpeed ZeRO for memory-efficient distributed training
Checkpoints are saved at configurable intervals

Step 5: Evaluate and save

After training completes, the trainer returns a trainer object that wraps the fine-tuned model. The model can be used for generation directly or saved to disk in HuggingFace format for later use. Evaluation metrics (reward scores on eval prompts) are logged throughout training.

Key considerations:

Use trainer.generate() to test the model interactively
Save with trainer.save_pretrained() for HuggingFace-compatible output
The saved model can be uploaded to HuggingFace Hub directly

Execution Diagram

GitHub URL

Workflow Repository