Workflow:Hiyouga LLaMA Factory PPO RLHF Training

Knowledge Sources	LLaMA-Factory LLaMA-Factory Docs RLHF Paper
Domains	LLMs, Fine_Tuning, RLHF, PPO
Last Updated	2026-02-06 19:00 GMT

Overview

End-to-end process for aligning a language model with human preferences using Proximal Policy Optimization (PPO) with a trained reward model in the full RLHF pipeline.

Description

This workflow implements the complete Reinforcement Learning from Human Feedback pipeline using PPO. Unlike DPO which optimizes preferences directly, PPO requires a separate trained reward model to score generated responses. The training loop alternates between generating responses from the policy model, scoring them with the reward model, computing advantages, and updating the policy using the PPO objective with KL divergence regularization against a reference model. This is the most complex but also the most flexible alignment approach, allowing for arbitrary reward signals.

Usage

Execute this workflow when you have a trained reward model and need fine-grained control over the alignment objective. PPO is preferred when the reward signal is complex (e.g., combining multiple criteria), when the reward model captures nuances that paired preference data cannot, or when dynamic response generation during training is important for exploration. Requires more compute and careful hyperparameter tuning than DPO.

Execution Steps

Step 1: Configuration

Define the PPO training job specifying the SFT-tuned actor model, the trained reward model path, PPO-specific hyperparameters, and the unsupervised prompt dataset. The configuration must set stage: ppo and provide the reward model path.

Key considerations:

The actor model should be a previously SFT-tuned model
Set reward_model to the path of a trained reward model (from the RM training stage)
The dataset should contain prompts only (no responses), as responses are generated during training
PPO hyperparameters include: learning rate, KL penalty coefficient, number of PPO epochs per batch, clip range, and value function coefficient
A value head is attached to the actor model for advantage estimation

Step 2: Argument Parsing and Validation

Parse and validate the PPO-specific configuration. The parser ensures the reward model path is valid, the dataset format matches the PPO stage requirements (prompts only), and the actor model is compatible with value head attachment.

What happens:

Arguments are parsed with stage set to "ppo"
The unsupervised data processor is selected (processes prompts without responses)
Reward model path is validated
Value head configuration is prepared for the actor model

Step 3: Prompt Data Loading

Load the prompt dataset for PPO training. Unlike SFT or DPO, PPO only needs prompts (without responses) since the model generates its own responses during training. The prompts are formatted using the model's chat template.

Key considerations:

Data format is unsupervised: only prompts are provided, no target responses
The unsupervised processor formats prompts with the chat template
Prompts should be representative of the target use case
Batch sizes affect the diversity of generated responses per update

Step 4: Model Loading (Actor, Reference, Reward)

Load three models: the actor model (policy to optimize) with a value head, a frozen reference model for KL regularization, and the reward model for scoring generated responses. This is the most memory-intensive step as three full models must be held simultaneously.

What happens:

Actor model: Loaded with LoRA adapters and a value head for advantage estimation
Reference model: A frozen copy of the actor for computing KL divergence penalty
Reward model: Loaded separately with its own value head for scoring responses
Value head parameters are loaded from the reward model checkpoint
All models share the same tokenizer

Step 5: PPO Training Loop

Execute the PPO training loop which alternates between response generation, reward scoring, advantage computation, and policy updates. Each iteration generates a batch of responses, scores them, computes per-token advantages, and performs multiple PPO update epochs on the collected experience.

What happens:

Generation phase: The actor model generates responses to prompt batches using sampling
Scoring phase: The reward model scores each generated response
Advantage computation: Per-token advantages are computed using Generalized Advantage Estimation (GAE)
KL penalty: KL divergence between actor and reference model log-probabilities is computed as a regularization term
Policy update: Multiple PPO epochs update the actor model using the clipped surrogate objective
Value update: The value head is updated to better predict future rewards
Reward statistics (mean, variance) are tracked for monitoring alignment progress

Step 6: Save Aligned Model

Save the PPO-aligned actor model weights (without the value head) and the training state. The value head parameters are discarded as they are only needed during PPO training.

Key considerations:

The value head is removed before saving the final model
For LoRA PPO: only the updated adapter weights are saved
The aligned model represents the final policy: SFT-tuned + reward-aligned
The model can be used directly for inference or further processed through the export workflow

Execution Diagram

GitHub URL

Workflow Repository