Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory PPO Workflow

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning, RLHF, Training Workflow
Last Updated 2026-02-06 19:00 GMT

Overview

run_ppo is the end-to-end orchestrator function that assembles and executes the complete PPO/RLHF training pipeline.

Description

The run_ppo function loads the tokenizer (configured with left-padding for generation), template, dataset at the "ppo" stage, and the policy model with a value head. It then creates the reference model and reward model, initializes a CustomPPOTrainer, and runs the PPO training loop with checkpoint saving, value-head fixing via fix_valuehead_checkpoint, and optional loss plotting. This function serves as the single entry point that wires together all components required for proximal policy optimization training.

Usage

Use run_ppo when performing RLHF-style training with proximal policy optimization. This is typically invoked by the framework's training dispatcher when the training stage is set to "ppo". It expects all argument dataclasses (model, data, training, finetuning, generating) to be pre-configured.

Code Reference

Source Location

Signature

def run_ppo(
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    finetuning_args: "FinetuningArguments",
    generating_args: "GeneratingArguments",
    callbacks: Optional[list["TrainerCallback"]] = None,
) -> None

Import

from llamafactory.train.ppo.workflow import run_ppo

I/O Contract

Inputs

Name Type Required Description
model_args ModelArguments Yes Model configuration including path, quantization, and adapter settings
data_args DataArguments Yes Dataset configuration including dataset name, split ratios, and preprocessing options
training_args Seq2SeqTrainingArguments Yes HuggingFace training arguments controlling training hyperparameters and output directory
finetuning_args FinetuningArguments Yes Fine-tuning configuration including LoRA, reward model, and plot_loss settings
generating_args GeneratingArguments Yes Generation parameters (temperature, top_p, max_length) used during PPO rollouts
callbacks Optional[list[TrainerCallback]] No Additional trainer callbacks to register with the PPO trainer

Outputs

Name Type Description
(none) None Side effects: saves model checkpoint, trainer state, value-head checkpoint, and optional loss plot to training_args.output_dir

Usage Examples

# Typical invocation from the training dispatcher
from llamafactory.train.ppo.workflow import run_ppo

run_ppo(
    model_args=model_args,
    data_args=data_args,
    training_args=training_args,
    finetuning_args=finetuning_args,
    generating_args=generating_args,
    callbacks=None,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment