Principle:Huggingface Trl PPO Argument Configuration
| Property | Value |
|---|---|
| Principle Name | PPO Argument Configuration |
| Technology | Huggingface TRL |
| Category | Configuration |
| Workflow | PPO RLHF Training |
| Paper | PPO (https://arxiv.org/abs/1707.06347) |
| Implementation | Implementation:Huggingface_Trl_HfArgumentParser_PPOConfig |
Overview
Description
The PPOConfig dataclass provides all hyperparameters for Proximal Policy Optimization (PPO) training in the RLHF pipeline. It extends transformers.TrainingArguments with PPO-specific parameters including the clipped surrogate objective settings, Generalized Advantage Estimation (GAE) parameters, KL divergence constraints, value function coefficients, and generation settings for online rollouts.
PPO training is fundamentally different from supervised training because the training data is generated online by the policy model itself. This requires additional configuration for response generation (temperature, response length, stop tokens), model paths (SFT policy and reward model), and the complex batch size hierarchy (micro, local, global, mini).
Usage
PPOConfig is instantiated either directly or parsed from command-line arguments using HfArgumentParser. It is passed as the args parameter to PPOTrainer.
Theoretical Basis
PPO Hyperparameters
The Proximal Policy Optimization algorithm uses a clipped surrogate objective to ensure stable policy updates. Key hyperparameters include:
- num_ppo_epochs (default: 4): Number of optimization passes over each batch of rollout data. Multiple epochs extract more signal from each expensive rollout but risk overfitting to stale advantages.
- cliprange (default: 0.2): The clipping parameter epsilon for the surrogate objective. The policy ratio is clipped to [1-epsilon, 1+epsilon], preventing excessively large updates. The standard value of 0.2 balances exploration and stability.
- vf_coef (default: 0.1): Coefficient for the value function loss in the total loss. A lower coefficient (compared to the policy loss) prevents the value function from dominating gradient updates.
- cliprange_value (default: 0.2): Clipping range for the value function predictions, analogous to the policy clip range. Prevents value function estimates from changing too rapidly.
GAE Lambda
Generalized Advantage Estimation parameters control the bias-variance tradeoff in advantage computation:
- gamma (default: 1.0): Discount factor for future rewards. Setting gamma=1.0 means no discounting, treating all future rewards equally, which is appropriate for episodic text generation tasks.
- lam (default: 0.95): GAE lambda parameter. Higher values (closer to 1.0) produce lower-bias but higher-variance advantage estimates. The default 0.95 provides a good balance for language model RLHF.
KL Coefficient
- kl_coef (default: 0.05): Coefficient for the KL divergence penalty added to the reward signal. This penalizes the policy for deviating too far from the reference policy, preventing reward hacking and catastrophic forgetting. The KL penalty is computed per-token and added to the reward before advantage computation.
- kl_estimator (default: "k1"): Which estimator for KL divergence to use. "k1" is a straightforward unbiased estimator; "k3" is an unbiased estimator with lower variance.
Batch Size Hierarchy
PPO training uses a multi-level batch size hierarchy:
| Level | Formula | Description |
|---|---|---|
| micro_batch_size | per_device_train_batch_size * world_size | Batch across all GPUs in one forward pass |
| local_batch_size | per_device_train_batch_size * gradient_accumulation_steps | Batch per GPU including accumulation |
| batch_size | local_batch_size * world_size | Total batch across all GPUs and accumulation steps |
| mini_batch_size | batch_size / num_mini_batches | Size of each PPO optimization minibatch |