Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory RM Workflow

From Leeroopedia


Knowledge Sources
Domains Reward Modeling, RLHF, Training Workflow
Last Updated 2026-02-06 19:00 GMT

Overview

run_rm is the end-to-end orchestrator function for reward model training, evaluation, and prediction.

Description

The run_rm function loads the tokenizer, template, pairwise dataset at the "rm" stage, and the model with a value head. It creates a PairwiseDataCollatorWithPadding and initializes a PairwiseTrainer with the ComputeAccuracy metric. The function drives three optional phases: training (with value-head checkpoint fixing and loss/accuracy plotting), evaluation, and prediction (saving chosen/rejected scores to JSONL). It concludes by creating and optionally pushing a model card.

Usage

Use run_rm when training a reward model from pairwise preference data. This is invoked by the framework's training dispatcher when the training stage is set to "rm". The resulting reward model can be used downstream by PPO or other RLHF methods.

Code Reference

Source Location

Signature

def run_rm(
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    finetuning_args: "FinetuningArguments",
    callbacks: Optional[list["TrainerCallback"]] = None,
) -> None

Import

from llamafactory.train.rm.workflow import run_rm

I/O Contract

Inputs

Name Type Required Description
model_args ModelArguments Yes Model configuration; model is loaded with add_valuehead=True
data_args DataArguments Yes Dataset configuration for pairwise preference data
training_args Seq2SeqTrainingArguments Yes Training hyperparameters; do_train, do_eval, do_predict flags control workflow phases
finetuning_args FinetuningArguments Yes Fine-tuning settings including plot_loss flag
callbacks Optional[list[TrainerCallback]] No Additional trainer callbacks

Outputs

Name Type Description
(none) None Side effects: saves model with value head, metrics (loss and accuracy), predictions (chosen/rejected scores as JSONL), loss plots, and model card to output_dir

Usage Examples

# Typical invocation for reward model training
from llamafactory.train.rm.workflow import run_rm

run_rm(
    model_args=model_args,
    data_args=data_args,
    training_args=training_args,
    finetuning_args=finetuning_args,
    callbacks=None,
)

# When do_predict is set, predictions are saved as JSONL:
# {"chosen": 1.23, "rejected": -0.45}
# {"chosen": 0.89, "rejected": -0.12}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment