Implementation:Huggingface Trl HfArgumentParser PPOConfig
Appearance
| Property | Value |
|---|---|
| Implementation Name | HfArgumentParser PPOConfig |
| Technology | Huggingface TRL |
| Type | API Doc |
| Workflow | PPO RLHF Training |
| Paper | PPO (https://arxiv.org/abs/1707.06347) |
| Principle | Principle:Huggingface_Trl_PPO_Argument_Configuration |
Overview
Description
The PPOConfig dataclass defines all hyperparameters for PPO-based RLHF training. It extends transformers.TrainingArguments with PPO-specific fields including clipping ranges, GAE parameters, KL coefficients, generation settings, and model paths for the SFT policy and reward model. The configuration is parsed from command-line arguments using HfArgumentParser.
Usage
PPOConfig is typically parsed from command-line arguments in the PPO training script and passed as the args parameter to PPOTrainer.
Code Reference
Source Location
- PPOConfig:
trl/experimental/ppo/ppo_config.pylines 22-305 - Script usage:
examples/scripts/ppo/ppo.pylines 77-78
Signature
@dataclass
class PPOConfig(TrainingArguments):
# Overridden TrainingArguments defaults
logging_steps: float = 10
gradient_checkpointing: bool = True
bf16: bool | None = None
# Dataset parameters
dataset_num_proc: int | None = None
# Batch structure parameters
num_mini_batches: int = 1
total_episodes: int | None = None
local_rollout_forward_batch_size: int = 64
# Generation parameters
num_sample_generations: int = 10
response_length: int = 53
stop_token: Literal["eos"] | None = None
stop_token_id: int | None = None
temperature: float = 0.7
missing_eos_penalty: float | None = None
# Model paths
sft_model_path: str = "EleutherAI/pythia-160m"
reward_model_path: str = "EleutherAI/pythia-160m"
# PEFT adapter names
model_adapter_name: str | None = None
ref_adapter_name: str | None = None
# PPO algorithm parameters
num_ppo_epochs: int = 4
whiten_rewards: bool = False
kl_coef: float = 0.05
kl_estimator: Literal["k1", "k3"] = "k1"
cliprange: float = 0.2
vf_coef: float = 0.1
cliprange_value: float = 0.2
gamma: float = 1.0
lam: float = 0.95
# DeepSpeed settings
ds3_gather_for_generation: bool = True
# Computed fields (set during __init__ of PPOTrainer)
world_size: int | None = None
num_total_batches: int | None = None
micro_batch_size: int | None = None
local_batch_size: int | None = None
batch_size: int | None = None
local_mini_batch_size: int | None = None
mini_batch_size: int | None = None
# Hub
push_to_hub: bool = False
Import
from trl.experimental.ppo import PPOConfig
from transformers import HfArgumentParser
I/O Contract
Inputs (Key Parameters)
| Parameter | Type | Default | Description |
|---|---|---|---|
| num_ppo_epochs | int | 4 | Number of PPO optimization epochs per batch of rollout data |
| kl_coef | float | 0.05 | KL divergence penalty coefficient |
| kl_estimator | str | "k1" | KL estimator variant: "k1" (unbiased) or "k3" (lower variance) |
| cliprange | float | 0.2 | Policy ratio clipping range [1-eps, 1+eps] |
| vf_coef | float | 0.1 | Value function loss coefficient |
| cliprange_value | float | 0.2 | Value prediction clipping range |
| gamma | float | 1.0 | Discount factor for future rewards |
| lam | float | 0.95 | GAE lambda parameter |
| response_length | int | 53 | Maximum length of generated responses |
| temperature | float | 0.7 | Sampling temperature for response generation |
| sft_model_path | str | "EleutherAI/pythia-160m" | Path to the supervised fine-tuned model |
| reward_model_path | str | "EleutherAI/pythia-160m" | Path to the trained reward model |
| total_episodes | int or None | None | Total training episodes; computed from num_train_epochs if None |
| missing_eos_penalty | float or None | None | Penalty for responses without EOS tokens |
Outputs
| Output | Type | Description |
|---|---|---|
| PPOConfig instance | PPOConfig | Fully configured PPO training arguments |
Usage Examples
Command-Line Parsing
from transformers import HfArgumentParser
from trl import ScriptArguments, ModelConfig
from trl.experimental.ppo import PPOConfig
parser = HfArgumentParser((ScriptArguments, PPOConfig, ModelConfig))
script_args, training_args, model_args = parser.parse_args_into_dataclasses()
Direct Instantiation
from trl.experimental.ppo import PPOConfig
config = PPOConfig(
output_dir="ppo-output",
per_device_train_batch_size=64,
gradient_accumulation_steps=1,
total_episodes=10000,
learning_rate=3e-6,
num_ppo_epochs=4,
kl_coef=0.05,
cliprange=0.2,
vf_coef=0.1,
gamma=1.0,
lam=0.95,
response_length=53,
temperature=0.7,
sft_model_path="EleutherAI/pythia-1b-deduped",
reward_model_path="my-reward-model",
missing_eos_penalty=1.0,
)
Command-Line Invocation
python examples/scripts/ppo/ppo.py \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--dataset_train_split descriptiveness \
--learning_rate 3e-6 \
--output_dir ppo-output \
--per_device_train_batch_size 64 \
--total_episodes 10000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path EleutherAI/pythia-1b-deduped \
--reward_model_path my-reward-model \
--missing_eos_penalty 1.0
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment