Implementation:Hiyouga LLaMA Factory PPO Workflow
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning, RLHF, Training Workflow |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
run_ppo is the end-to-end orchestrator function that assembles and executes the complete PPO/RLHF training pipeline.
Description
The run_ppo function loads the tokenizer (configured with left-padding for generation), template, dataset at the "ppo" stage, and the policy model with a value head. It then creates the reference model and reward model, initializes a CustomPPOTrainer, and runs the PPO training loop with checkpoint saving, value-head fixing via fix_valuehead_checkpoint, and optional loss plotting. This function serves as the single entry point that wires together all components required for proximal policy optimization training.
Usage
Use run_ppo when performing RLHF-style training with proximal policy optimization. This is typically invoked by the framework's training dispatcher when the training stage is set to "ppo". It expects all argument dataclasses (model, data, training, finetuning, generating) to be pre-configured.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/train/ppo/workflow.py
- Lines: 1-79
Signature
def run_ppo(
model_args: "ModelArguments",
data_args: "DataArguments",
training_args: "Seq2SeqTrainingArguments",
finetuning_args: "FinetuningArguments",
generating_args: "GeneratingArguments",
callbacks: Optional[list["TrainerCallback"]] = None,
) -> None
Import
from llamafactory.train.ppo.workflow import run_ppo
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_args | ModelArguments | Yes | Model configuration including path, quantization, and adapter settings |
| data_args | DataArguments | Yes | Dataset configuration including dataset name, split ratios, and preprocessing options |
| training_args | Seq2SeqTrainingArguments | Yes | HuggingFace training arguments controlling training hyperparameters and output directory |
| finetuning_args | FinetuningArguments | Yes | Fine-tuning configuration including LoRA, reward model, and plot_loss settings |
| generating_args | GeneratingArguments | Yes | Generation parameters (temperature, top_p, max_length) used during PPO rollouts |
| callbacks | Optional[list[TrainerCallback]] | No | Additional trainer callbacks to register with the PPO trainer |
Outputs
| Name | Type | Description |
|---|---|---|
| (none) | None | Side effects: saves model checkpoint, trainer state, value-head checkpoint, and optional loss plot to training_args.output_dir |
Usage Examples
# Typical invocation from the training dispatcher
from llamafactory.train.ppo.workflow import run_ppo
run_ppo(
model_args=model_args,
data_args=data_args,
training_args=training_args,
finetuning_args=finetuning_args,
generating_args=generating_args,
callbacks=None,
)
Related Pages
- Hiyouga_LLaMA_Factory_PPO_Trainer - The CustomPPOTrainer class used internally
- Hiyouga_LLaMA_Factory_RM_Workflow - Reward model training workflow that produces models used by PPO
- Hiyouga_LLaMA_Factory_SFT_Workflow - Supervised fine-tuning workflow, often used as a prerequisite step before PPO
- Hiyouga_LLaMA_Factory_Trainer_Utils - Utility functions for creating reference and reward models