Principle:Huggingface Trl PPO Multi Model Loading
| Property | Value |
|---|---|
| Principle Name | PPO Multi Model Loading |
| Technology | Huggingface TRL |
| Category | Model Architecture |
| Workflow | PPO RLHF Training |
| Implementation | Implementation:Huggingface_Trl_PPO_Model_Loading_Pattern |
Overview
Description
PPO-based RLHF training requires loading four distinct neural network models that work together during training. The policy model generates responses and is the model being optimized. The reference policy provides a frozen baseline for computing the KL divergence penalty. The reward model scores generated responses to provide the training signal. The value model estimates the expected future reward at each token position for advantage computation.
This multi-model architecture makes PPO RLHF significantly more memory-intensive than supervised fine-tuning, requiring careful memory management through techniques like DeepSpeed ZeRO, PEFT/LoRA, and quantization.
Usage
All four models are loaded explicitly in the PPO training script and passed to PPOTrainer. The policy and reference models are loaded from the SFT model path, while the reward and value models are loaded from the reward model path.
Theoretical Basis
Actor-Critic Architecture
PPO implements an actor-critic reinforcement learning framework:
- Actor (Policy): A causal language model (AutoModelForCausalLM) that generates text responses. This model's parameters are updated through the PPO objective.
- Critic (Value Model): A sequence classifier (AutoModelForSequenceClassification with num_labels=1) that estimates the value function V(s) at each state (token position). It predicts the expected cumulative reward from that position onward.
The actor and critic are combined into a single PolicyAndValueWrapper that performs joint forward passes, sharing the input processing while producing both text generation logits and value predictions.
Separate Reward Model
The reward model is a distinct AutoModelForSequenceClassification that provides the external reward signal. It is loaded from the trained reward model checkpoint (produced by the reward training workflow) and remains frozen during PPO training. Key properties:
- Frozen weights: The reward model is not updated during PPO training.
- Scalar output: Produces a single reward value per sequence (num_labels=1).
- Full sequence scoring: Evaluates the complete prompt+response to assign a quality score.
Reference Policy for KL Constraint
The reference policy is a frozen copy of the initial SFT model that serves as a baseline for the KL divergence penalty:
KL_penalty = -kl_coef * KL(policy || ref_policy)
This penalty prevents the policy from deviating too far from the reference, maintaining coherent language generation while optimizing for reward. Without this constraint, the policy tends to exploit reward model weaknesses (reward hacking).
When PEFT/LoRA is used, the reference policy is implemented implicitly by disabling the LoRA adapter rather than loading a separate model, saving significant memory.
Memory Considerations
Loading four full models simultaneously requires substantial GPU memory:
| Model | Type | Trainable | Purpose |
|---|---|---|---|
| Policy | AutoModelForCausalLM | Yes | Generates responses, parameters updated by PPO |
| Reference Policy | AutoModelForCausalLM | No | Frozen baseline for KL divergence computation |
| Reward Model | AutoModelForSequenceClassification | No | Scores generated responses |
| Value Model | AutoModelForSequenceClassification | Yes | Estimates state values for advantage computation |
Memory optimization strategies include:
- PEFT/LoRA: Trains only adapter parameters for the policy; eliminates the need for a separate reference model.
- DeepSpeed ZeRO-3: Shards model parameters across GPUs.
- Quantization: Loads models in 4-bit or 8-bit precision.