Implementation:Huggingface Trl HfArgumentParser PPOConfig

Property	Value
Implementation Name	HfArgumentParser PPOConfig
Technology	Huggingface TRL
Type	API Doc
Workflow	PPO RLHF Training
Paper	PPO (https://arxiv.org/abs/1707.06347)
Principle	Principle:Huggingface_Trl_PPO_Argument_Configuration

Overview

Description

The PPOConfig dataclass defines all hyperparameters for PPO-based RLHF training. It extends transformers.TrainingArguments with PPO-specific fields including clipping ranges, GAE parameters, KL coefficients, generation settings, and model paths for the SFT policy and reward model. The configuration is parsed from command-line arguments using HfArgumentParser.

Usage

PPOConfig is typically parsed from command-line arguments in the PPO training script and passed as the args parameter to PPOTrainer.

Code Reference

Source Location

PPOConfig: trl/experimental/ppo/ppo_config.py lines 22-305
Script usage: examples/scripts/ppo/ppo.py lines 77-78

Signature

@dataclass
class PPOConfig(TrainingArguments):
    # Overridden TrainingArguments defaults
    logging_steps: float = 10
    gradient_checkpointing: bool = True
    bf16: bool | None = None

    # Dataset parameters
    dataset_num_proc: int | None = None

    # Batch structure parameters
    num_mini_batches: int = 1
    total_episodes: int | None = None
    local_rollout_forward_batch_size: int = 64

    # Generation parameters
    num_sample_generations: int = 10
    response_length: int = 53
    stop_token: Literal["eos"] | None = None
    stop_token_id: int | None = None
    temperature: float = 0.7
    missing_eos_penalty: float | None = None

    # Model paths
    sft_model_path: str = "EleutherAI/pythia-160m"
    reward_model_path: str = "EleutherAI/pythia-160m"

    # PEFT adapter names
    model_adapter_name: str | None = None
    ref_adapter_name: str | None = None

    # PPO algorithm parameters
    num_ppo_epochs: int = 4
    whiten_rewards: bool = False
    kl_coef: float = 0.05
    kl_estimator: Literal["k1", "k3"] = "k1"
    cliprange: float = 0.2
    vf_coef: float = 0.1
    cliprange_value: float = 0.2
    gamma: float = 1.0
    lam: float = 0.95

    # DeepSpeed settings
    ds3_gather_for_generation: bool = True

    # Computed fields (set during __init__ of PPOTrainer)
    world_size: int | None = None
    num_total_batches: int | None = None
    micro_batch_size: int | None = None
    local_batch_size: int | None = None
    batch_size: int | None = None
    local_mini_batch_size: int | None = None
    mini_batch_size: int | None = None

    # Hub
    push_to_hub: bool = False

Import

from trl.experimental.ppo import PPOConfig
from transformers import HfArgumentParser

I/O Contract

Inputs (Key Parameters)

Parameter	Type	Default	Description
num_ppo_epochs	int	4	Number of PPO optimization epochs per batch of rollout data
kl_coef	float	0.05	KL divergence penalty coefficient
kl_estimator	str	"k1"	KL estimator variant: "k1" (unbiased) or "k3" (lower variance)
cliprange	float	0.2	Policy ratio clipping range [1-eps, 1+eps]
vf_coef	float	0.1	Value function loss coefficient
cliprange_value	float	0.2	Value prediction clipping range
gamma	float	1.0	Discount factor for future rewards
lam	float	0.95	GAE lambda parameter
response_length	int	53	Maximum length of generated responses
temperature	float	0.7	Sampling temperature for response generation
sft_model_path	str	"EleutherAI/pythia-160m"	Path to the supervised fine-tuned model
reward_model_path	str	"EleutherAI/pythia-160m"	Path to the trained reward model
total_episodes	int or None	None	Total training episodes; computed from num_train_epochs if None
missing_eos_penalty	float or None	None	Penalty for responses without EOS tokens

Outputs

Output	Type	Description
PPOConfig instance	PPOConfig	Fully configured PPO training arguments

Usage Examples

Command-Line Parsing

from transformers import HfArgumentParser
from trl import ScriptArguments, ModelConfig
from trl.experimental.ppo import PPOConfig

parser = HfArgumentParser((ScriptArguments, PPOConfig, ModelConfig))
script_args, training_args, model_args = parser.parse_args_into_dataclasses()

Direct Instantiation

from trl.experimental.ppo import PPOConfig

config = PPOConfig(
    output_dir="ppo-output",
    per_device_train_batch_size=64,
    gradient_accumulation_steps=1,
    total_episodes=10000,
    learning_rate=3e-6,
    num_ppo_epochs=4,
    kl_coef=0.05,
    cliprange=0.2,
    vf_coef=0.1,
    gamma=1.0,
    lam=0.95,
    response_length=53,
    temperature=0.7,
    sft_model_path="EleutherAI/pythia-1b-deduped",
    reward_model_path="my-reward-model",
    missing_eos_penalty=1.0,
)

Command-Line Invocation

python examples/scripts/ppo/ppo.py \
    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
    --dataset_train_split descriptiveness \
    --learning_rate 3e-6 \
    --output_dir ppo-output \
    --per_device_train_batch_size 64 \
    --total_episodes 10000 \
    --model_name_or_path EleutherAI/pythia-1b-deduped \
    --sft_model_path EleutherAI/pythia-1b-deduped \
    --reward_model_path my-reward-model \
    --missing_eos_penalty 1.0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment