Workflow:CarperAI Trlx PPO Sentiment Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Reinforcement_Learning, RLHF |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
End-to-end process for online reinforcement learning fine-tuning of language models using Proximal Policy Optimization (PPO) with a live reward function.
Description
This workflow demonstrates the canonical trlX use case: training a language model to generate text that maximizes a reward signal using PPO. The model generates text completions from prompts, a reward function scores them in real time, and PPO updates the model weights to increase expected reward while maintaining proximity to the original model via KL divergence penalties. The process uses the unified trlx.train() API with a reward function callback, handling configuration, prompt pipeline setup, rollout generation, and iterative policy optimization automatically.
Usage
Execute this workflow when you have a pretrained language model and a computable reward function (e.g., a sentiment classifier, toxicity detector, or any scoring model) and want the model to generate outputs that maximize that reward. This is the standard approach for online RLHF where rewards can be computed programmatically at training time.
Execution Steps
Step 1: Configure training
Set up the training configuration by loading a default PPO config and optionally overriding hyperparameters. The configuration specifies the base model, tokenizer, optimizer settings, learning rate schedule, and PPO-specific parameters (KL coefficient, clipping range, number of rollouts, generation settings). Configuration can be loaded from a default factory function, a YAML file, or constructed programmatically.
Key considerations:
- Choose an appropriate base model (e.g., GPT-2 for prototyping, GPT-J-6B for production)
- Set sequence length and generation parameters (max_new_tokens, top_k, top_p)
- Tune PPO hyperparameters: init_kl_coef controls reward/KL trade-off, num_rollouts sets samples per batch
- Reduce batch_size and num_layers_unfrozen if memory-constrained
Step 2: Define the reward function
Implement a callable that takes generated text samples and returns a list of scalar reward values. The reward function signature accepts samples (generated texts) and optional keyword arguments, returning one float per sample. Common implementations use a pretrained classifier pipeline, a reward model, or a rule-based scoring function.
Key considerations:
- The reward function runs on every batch during training, so it must be efficient
- Use batched inference for reward model evaluation
- Consider placing the reward model on a separate GPU to avoid memory contention with the policy model
- Reward normalization (running mean/std) can stabilize training
Step 3: Prepare prompts
Load or construct the prompt dataset that will be used to generate completions during training. Prompts are short text prefixes that the model will complete. An evaluation prompt set is also prepared for periodic validation. Prompts are tokenized and loaded into a PromptPipeline that handles batching and distribution across processes.
Key considerations:
- Prompts should be representative of the target distribution
- Evaluation prompts should be fixed across training for consistent metrics
- Prompts longer than max_prompt_length (seq_length minus max_new_tokens) will be truncated
Step 4: Launch PPO training
Call the unified trlx.train() entry point with the reward function, prompts, evaluation prompts, and configuration. This dispatches to the AcceleratePPOTrainer, which orchestrates the training loop: generating rollouts from the current policy, scoring them with the reward function, computing PPO loss (policy gradient with clipping, value function loss, KL penalty), and updating model weights. Training progress is logged to Weights & Biases by default.
Key considerations:
- Training alternates between generation (rollout) and optimization (PPO update) phases
- A frozen reference model copy is maintained for KL divergence computation
- The trainer supports DeepSpeed ZeRO for memory-efficient distributed training
- Checkpoints are saved at configurable intervals
Step 5: Evaluate and save
After training completes, the trainer returns a trainer object that wraps the fine-tuned model. The model can be used for generation directly or saved to disk in HuggingFace format for later use. Evaluation metrics (reward scores on eval prompts) are logged throughout training.
Key considerations:
- Use trainer.generate() to test the model interactively
- Save with trainer.save_pretrained() for HuggingFace-compatible output
- The saved model can be uploaded to HuggingFace Hub directly