Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory DPO Workflow

From Leeroopedia


Knowledge Sources
Domains Preference Optimization, Training Pipeline
Last Updated 2026-02-06 19:00 GMT

Overview

Orchestrates the end-to-end Direct Preference Optimization (DPO) training and evaluation workflow.

Description

The run_dpo function implements the complete DPO training pipeline. It loads the tokenizer, template, and pairwise dataset (using stage="rm"), then loads the model and creates a reference model if required by the finetuning arguments. The trainer selection supports both CustomDPOTrainer for standard training and KDPOTrainer for KTransformers-accelerated training based on the use_kt flag. The training phase includes metric logging, model saving, effective tokens-per-second calculation, and optional loss plotting for keys including "loss" and "rewards/accuracies". The evaluation phase handles the special case where the reference model is the same object as the model itself by removing reward-related metrics. Finally, it creates a model card and optionally pushes to the HuggingFace Hub.

Usage

Use this workflow as the main entry point for DPO preference optimization training. It is called by the training tuner when the training stage is set to "dpo".

Code Reference

Source Location

Signature

def run_dpo(
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    finetuning_args: "FinetuningArguments",
    callbacks: Optional[list["TrainerCallback"]] = None,
) -> None

Import

from llamafactory.train.dpo.workflow import run_dpo

I/O Contract

Inputs

Name Type Required Description
model_args ModelArguments Yes Model loading configuration including path, quantization, and compute settings
data_args DataArguments Yes Dataset configuration including data path, template name, and padding options
training_args Seq2SeqTrainingArguments Yes HuggingFace training arguments including output_dir, do_train, do_eval, resume_from_checkpoint
finetuning_args FinetuningArguments Yes DPO-specific args including use_ref_model, ref_model, pref_beta, plot_loss, include_effective_tokens_per_second, use_kt
callbacks Optional[list[TrainerCallback]] No Additional trainer callbacks to register

Outputs

Name Type Description
None None Side effects: saves model, metrics, and trainer state to output_dir; optionally pushes to Hub

Usage Examples

# Running DPO training from the tuner
from llamafactory.train.dpo.workflow import run_dpo

run_dpo(
    model_args=model_args,
    data_args=data_args,
    training_args=training_args,
    finetuning_args=finetuning_args,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment