Workflow:CarperAI Trlx RLHF Summarization Pipeline

Knowledge Sources	trlx Learning to Summarize from Human Feedback Summarize RLHF with trlX trlx Documentation
Domains	LLMs, RLHF, Summarization, Reward_Modeling
Last Updated	2026-02-07 16:00 GMT

Overview

End-to-end three-stage Reinforcement Learning from Human Feedback (RLHF) pipeline for training a summarization model, following the approach from "Learning to Summarize from Human Feedback."

Description

This workflow implements the complete RLHF training pipeline for text summarization using GPT-J-6B as the base model. It follows three sequential stages: (1) supervised fine-tuning on the TL;DR summarization dataset, (2) training a reward model on human comparison data that predicts which summary a human would prefer, and (3) PPO training that uses the reward model to further optimize the SFT model for generating summaries humans prefer. This is the most comprehensive example in the trlX repository, demonstrating the full RLHF workflow from raw data to aligned model.

Usage

Execute this workflow when you want to train a text summarization model using the full RLHF pipeline. This requires the CarperAI TL;DR summarization dataset and human comparison data. The pipeline requires significant compute resources (at least 55GB VRAM for the PPO stage with two GPUs) due to the GPT-J-6B model size.

Execution Steps

Step 1: Supervised fine-tuning on TL;DR

Fine-tune the base GPT-J-6B model on the TL;DR summarization dataset using standard supervised learning. This stage uses the HuggingFace Trainer directly (not trlX) with DeepSpeed for memory efficiency. The model learns to generate summaries given Reddit posts with a "TL;DR:" prompt format. Training uses gradient checkpointing and FP16 mixed precision.

Key considerations:

Uses the CarperAI/openai_summarize_tldr dataset
Input format is post text followed by "TL;DR:" separator
Max input length is 550 tokens to fit prompt and summary
DeepSpeed with FP16 is required due to GPT-J-6B model size
ROUGE metrics are computed during evaluation
The resulting checkpoint serves as the base for both reward model and PPO stages

Step 2: Train the reward model

Initialize a reward model from the SFT checkpoint and train it on human comparison data. The reward model adds a scalar value head on top of the transformer and is trained with a pairwise ranking loss: given a chosen and rejected summary for the same prompt, it learns to assign higher scores to the preferred summary. The bottom 70% of transformer layers are frozen to prevent catastrophic forgetting.

Key considerations:

Uses CarperAI/openai_summarize_comparisons dataset with chosen/rejected pairs
The GPTRewardModel architecture adds a linear value head to GPT-J
Pairwise training: each batch contains both chosen and rejected examples
Accuracy metric measures how often the model ranks chosen above rejected
Only the top 30% of layers plus the value head are trainable
DeepSpeed is used for memory-efficient training

Step 3: Prepare the PPO training environment

Load the trained reward model checkpoint and wrap it in a scoring function that trlX can call during PPO training. Prepare the prompt dataset by extracting and formatting prompts from the TL;DR dataset with proper truncation. Build a lookup dictionary mapping prompts to their original reference summaries for delta-reward computation (the reward is the improvement over the original summary score).

Key considerations:

The reward model is loaded on a separate GPU (cuda:1) from the policy model
Delta-reward (score minus original summary score) is used instead of absolute reward
Prompts must be carefully truncated to leave room for the "TL;DR:" suffix
A post-to-summary dictionary enables computing baseline scores for normalization

Step 4: Run PPO optimization

Launch PPO training using the trlX framework with the SFT model as the starting policy, the reward model scoring function, and the prepared prompts. The AcceleratePPOTrainer generates summary completions, scores them with the reward model, computes PPO loss with KL penalty against the SFT reference model, and iteratively updates the policy to maximize reward.

Key considerations:

Requires Accelerate multi-GPU config (at least 2 GPUs with 55GB+ VRAM total)
8 layers are unfrozen in the policy model for efficient training
PPO uses 128 rollouts per batch with chunk_size=16 for generation
KL coefficient (init_kl_coef=0.1) prevents the model from diverging too far from SFT
Cosine annealing schedule with 100K total steps
1000 validation prompts are sampled for evaluation speed

Step 5: Evaluate with ROUGE and reward scores

After PPO training, evaluate the model by generating summaries on the test set and computing ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) against reference summaries, as well as reward model scores. Compare PPO-trained model metrics against the SFT baseline to measure improvement.

Key considerations:

PPO models typically show improved reward scores but may have slightly lower ROUGE
The reward-ROUGE trade-off reflects the model optimizing for human preference rather than n-gram overlap
A separate inference script handles batched evaluation on the full test set

Execution Diagram

GitHub URL

Workflow Repository