Implementation:Hiyouga LLaMA Factory PPO Workflow

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Reinforcement Learning, RLHF, Training Workflow
Last Updated	2026-02-06 19:00 GMT

Overview

run_ppo is the end-to-end orchestrator function that assembles and executes the complete PPO/RLHF training pipeline.

Description

The run_ppo function loads the tokenizer (configured with left-padding for generation), template, dataset at the "ppo" stage, and the policy model with a value head. It then creates the reference model and reward model, initializes a CustomPPOTrainer, and runs the PPO training loop with checkpoint saving, value-head fixing via fix_valuehead_checkpoint, and optional loss plotting. This function serves as the single entry point that wires together all components required for proximal policy optimization training.

Usage

Use run_ppo when performing RLHF-style training with proximal policy optimization. This is typically invoked by the framework's training dispatcher when the training stage is set to "ppo". It expects all argument dataclasses (model, data, training, finetuning, generating) to be pre-configured.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/train/ppo/workflow.py
Lines: 1-79

Signature

def run_ppo(
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    finetuning_args: "FinetuningArguments",
    generating_args: "GeneratingArguments",
    callbacks: Optional[list["TrainerCallback"]] = None,
) -> None

Import

from llamafactory.train.ppo.workflow import run_ppo

I/O Contract

Inputs

Name	Type	Required	Description
model_args	ModelArguments	Yes	Model configuration including path, quantization, and adapter settings
data_args	DataArguments	Yes	Dataset configuration including dataset name, split ratios, and preprocessing options
training_args	Seq2SeqTrainingArguments	Yes	HuggingFace training arguments controlling training hyperparameters and output directory
finetuning_args	FinetuningArguments	Yes	Fine-tuning configuration including LoRA, reward model, and plot_loss settings
generating_args	GeneratingArguments	Yes	Generation parameters (temperature, top_p, max_length) used during PPO rollouts
callbacks	Optional[list[TrainerCallback]]	No	Additional trainer callbacks to register with the PPO trainer

Outputs

Name	Type	Description
(none)	None	Side effects: saves model checkpoint, trainer state, value-head checkpoint, and optional loss plot to training_args.output_dir

Usage Examples

# Typical invocation from the training dispatcher
from llamafactory.train.ppo.workflow import run_ppo

run_ppo(
    model_args=model_args,
    data_args=data_args,
    training_args=training_args,
    finetuning_args=finetuning_args,
    generating_args=generating_args,
    callbacks=None,
)

Related Pages

Hiyouga_LLaMA_Factory_PPO_Trainer - The CustomPPOTrainer class used internally
Hiyouga_LLaMA_Factory_RM_Workflow - Reward model training workflow that produces models used by PPO
Hiyouga_LLaMA_Factory_SFT_Workflow - Supervised fine-tuning workflow, often used as a prerequisite step before PPO
Hiyouga_LLaMA_Factory_Trainer_Utils - Utility functions for creating reference and reward models

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment