Principle:Huggingface Trl PPO Multi Model Loading

Property	Value
Principle Name	PPO Multi Model Loading
Technology	Huggingface TRL
Category	Model Architecture
Workflow	PPO RLHF Training
Implementation	Implementation:Huggingface_Trl_PPO_Model_Loading_Pattern

Overview

Description

PPO-based RLHF training requires loading four distinct neural network models that work together during training. The policy model generates responses and is the model being optimized. The reference policy provides a frozen baseline for computing the KL divergence penalty. The reward model scores generated responses to provide the training signal. The value model estimates the expected future reward at each token position for advantage computation.

This multi-model architecture makes PPO RLHF significantly more memory-intensive than supervised fine-tuning, requiring careful memory management through techniques like DeepSpeed ZeRO, PEFT/LoRA, and quantization.

Usage

All four models are loaded explicitly in the PPO training script and passed to PPOTrainer. The policy and reference models are loaded from the SFT model path, while the reward and value models are loaded from the reward model path.

Theoretical Basis

Actor-Critic Architecture

PPO implements an actor-critic reinforcement learning framework:

Actor (Policy): A causal language model (AutoModelForCausalLM) that generates text responses. This model's parameters are updated through the PPO objective.
Critic (Value Model): A sequence classifier (AutoModelForSequenceClassification with num_labels=1) that estimates the value function V(s) at each state (token position). It predicts the expected cumulative reward from that position onward.

The actor and critic are combined into a single PolicyAndValueWrapper that performs joint forward passes, sharing the input processing while producing both text generation logits and value predictions.

Separate Reward Model

The reward model is a distinct AutoModelForSequenceClassification that provides the external reward signal. It is loaded from the trained reward model checkpoint (produced by the reward training workflow) and remains frozen during PPO training. Key properties:

Frozen weights: The reward model is not updated during PPO training.
Scalar output: Produces a single reward value per sequence (num_labels=1).
Full sequence scoring: Evaluates the complete prompt+response to assign a quality score.

Reference Policy for KL Constraint

The reference policy is a frozen copy of the initial SFT model that serves as a baseline for the KL divergence penalty:

KL_penalty = -kl_coef * KL(policy || ref_policy)

This penalty prevents the policy from deviating too far from the reference, maintaining coherent language generation while optimizing for reward. Without this constraint, the policy tends to exploit reward model weaknesses (reward hacking).

When PEFT/LoRA is used, the reference policy is implemented implicitly by disabling the LoRA adapter rather than loading a separate model, saving significant memory.

Memory Considerations

Loading four full models simultaneously requires substantial GPU memory:

Model	Type	Trainable	Purpose
Policy	AutoModelForCausalLM	Yes	Generates responses, parameters updated by PPO
Reference Policy	AutoModelForCausalLM	No	Frozen baseline for KL divergence computation
Reward Model	AutoModelForSequenceClassification	No	Scores generated responses
Value Model	AutoModelForSequenceClassification	Yes	Estimates state values for advantage computation

Memory optimization strategies include:

PEFT/LoRA: Trains only adapter parameters for the policy; eliminates the need for a separate reference model.
DeepSpeed ZeRO-3: Shards model parameters across GPUs.
Quantization: Loads models in 4-bit or 8-bit precision.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment