Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Trl PPO Training Loop

From Leeroopedia


Property Value
Principle Name PPO Training Loop
Technology Huggingface TRL
Category Training Algorithm
Workflow PPO RLHF Training
Paper PPO (https://arxiv.org/abs/1707.06347), GAE (https://arxiv.org/abs/1506.02438)
Implementation Implementation:Huggingface_Trl_PPOTrainer_Train

Overview

Description

The PPO training loop implements the complete Proximal Policy Optimization algorithm for RLHF. Each iteration consists of four phases: (1) rollout generation where the policy produces responses, (2) reward computation with KL penalties, (3) advantage estimation using GAE, and (4) multiple epochs of policy and value function optimization with clipped objectives.

This is the most computationally intensive part of the RLHF pipeline, requiring careful memory management with explicit cache clearing between phases.

Usage

The training loop is invoked by calling trainer.train() on an initialized PPOTrainer instance. The loop runs for num_total_batches iterations, with each iteration processing a full batch of prompts through the complete PPO pipeline.

Theoretical Basis

Phase 1: Rollout Generation

In the rollout phase, the policy model generates responses for a batch of prompts. Generation uses sampling (not greedy) with the configured temperature and top-k/top-p settings. Key steps:

  • Batch generation: Queries are processed in sub-batches of size local_rollout_forward_batch_size to manage memory.
  • Log-probability computation: For each generated token, the log-probability under the current policy and reference policy is computed.
  • Response truncation: Responses are truncated at the first occurrence of the stop token, with remaining positions filled with pad tokens.
  • Reward scoring: The reward model scores the truncated responses.
  • Value estimation: The value model estimates state values at each token position.

Phase 2: KL-Penalized Rewards

The raw reward from the reward model is augmented with a per-token KL divergence penalty:

reward_total[t] = -kl_coef * KL[t] (for all tokens) reward_total[last_token] += score (add the sequence-level reward at the final position)

The KL divergence is computed per token using the selected estimator:

  • k1 estimator: KL = log(ref_prob / policy_prob) = ref_logprob - policy_logprob (negated as -logr)
  • k3 estimator: KL = (exp(logr) - 1) - logr (lower variance)

An optional missing_eos_penalty is applied to responses that fail to generate an EOS token, encouraging the model to produce complete responses.

Phase 3: GAE Advantage Estimation

Generalized Advantage Estimation computes the advantage function using a recursive formula:

delta[t] = reward[t] + gamma * V[t+1] - V[t] A[t] = delta[t] + gamma * lambda * A[t+1]

where:

  • delta[t] is the temporal difference error at step t.
  • gamma is the discount factor (default 1.0 for text generation).
  • lambda is the GAE parameter (default 0.95) controlling the bias-variance tradeoff.

The returns (targets for the value function) are computed as:

returns = advantages + values

Advantages are whitened (normalized to zero mean and unit variance) across the non-padded positions to stabilize training.

Phase 4: Clipped Policy and Value Optimization

For each batch of rollout data, num_ppo_epochs optimization passes are performed. In each pass, the data is shuffled and split into minibatches. For each minibatch:

Clipped Policy Loss:

ratio = exp(new_logprob - old_logprob) pg_loss1 = -advantage * ratio pg_loss2 = -advantage * clip(ratio, 1-epsilon, 1+epsilon) policy_loss = max(pg_loss1, pg_loss2)

The clipping prevents the policy from moving too far from its behavior during rollout generation.

Clipped Value Loss:

vpred_clipped = clip(vpred, old_values - epsilon_v, old_values + epsilon_v) vf_loss = 0.5 * max((vpred - returns)^2, (vpred_clipped - returns)^2)

Total Loss:

total_loss = policy_loss + vf_coef * value_loss

Training Metrics

Category Metric Description
Objective kl Mean KL divergence from reference policy
Objective entropy Mean entropy of policy distribution
Objective non_score_reward Mean per-token KL penalty
Objective rlhf_reward Combined reward (non_score_reward + scores)
Objective scores Mean reward model scores
Policy approxkl_avg Approximate KL between old and new policy (within PPO epochs)
Policy clipfrac_avg Fraction of policy ratios clipped
Loss policy_avg Mean policy gradient loss
Loss value_avg Mean value function loss
Value clipfrac_avg Fraction of value predictions clipped
Value ratio Mean importance sampling ratio
Value num_eos_tokens Number of responses containing EOS

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment