Implementation:Hiyouga LLaMA Factory RM Workflow

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Reward Modeling, RLHF, Training Workflow
Last Updated	2026-02-06 19:00 GMT

Overview

run_rm is the end-to-end orchestrator function for reward model training, evaluation, and prediction.

Description

The run_rm function loads the tokenizer, template, pairwise dataset at the "rm" stage, and the model with a value head. It creates a PairwiseDataCollatorWithPadding and initializes a PairwiseTrainer with the ComputeAccuracy metric. The function drives three optional phases: training (with value-head checkpoint fixing and loss/accuracy plotting), evaluation, and prediction (saving chosen/rejected scores to JSONL). It concludes by creating and optionally pushing a model card.

Usage

Use run_rm when training a reward model from pairwise preference data. This is invoked by the framework's training dispatcher when the training stage is set to "rm". The resulting reward model can be used downstream by PPO or other RLHF methods.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/train/rm/workflow.py
Lines: 1-98

Signature

def run_rm(
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    finetuning_args: "FinetuningArguments",
    callbacks: Optional[list["TrainerCallback"]] = None,
) -> None

Import

from llamafactory.train.rm.workflow import run_rm

I/O Contract

Inputs

Name	Type	Required	Description
model_args	ModelArguments	Yes	Model configuration; model is loaded with add_valuehead=True
data_args	DataArguments	Yes	Dataset configuration for pairwise preference data
training_args	Seq2SeqTrainingArguments	Yes	Training hyperparameters; do_train, do_eval, do_predict flags control workflow phases
finetuning_args	FinetuningArguments	Yes	Fine-tuning settings including plot_loss flag
callbacks	Optional[list[TrainerCallback]]	No	Additional trainer callbacks

Outputs

Name	Type	Description
(none)	None	Side effects: saves model with value head, metrics (loss and accuracy), predictions (chosen/rejected scores as JSONL), loss plots, and model card to output_dir

Usage Examples

# Typical invocation for reward model training
from llamafactory.train.rm.workflow import run_rm

run_rm(
    model_args=model_args,
    data_args=data_args,
    training_args=training_args,
    finetuning_args=finetuning_args,
    callbacks=None,
)

# When do_predict is set, predictions are saved as JSONL:
# {"chosen": 1.23, "rejected": -0.45}
# {"chosen": 0.89, "rejected": -0.12}

Related Pages

Hiyouga_LLaMA_Factory_RM_Trainer - The PairwiseTrainer class used internally
Hiyouga_LLaMA_Factory_RM_Metric - ComputeAccuracy metric class
Hiyouga_LLaMA_Factory_PPO_Workflow - PPO training that uses reward models produced by this workflow
Hiyouga_LLaMA_Factory_Callbacks - fix_valuehead_checkpoint utility for value-head checkpointing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment