Implementation:Hiyouga LLaMA Factory RM Workflow
| Knowledge Sources | |
|---|---|
| Domains | Reward Modeling, RLHF, Training Workflow |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
run_rm is the end-to-end orchestrator function for reward model training, evaluation, and prediction.
Description
The run_rm function loads the tokenizer, template, pairwise dataset at the "rm" stage, and the model with a value head. It creates a PairwiseDataCollatorWithPadding and initializes a PairwiseTrainer with the ComputeAccuracy metric. The function drives three optional phases: training (with value-head checkpoint fixing and loss/accuracy plotting), evaluation, and prediction (saving chosen/rejected scores to JSONL). It concludes by creating and optionally pushing a model card.
Usage
Use run_rm when training a reward model from pairwise preference data. This is invoked by the framework's training dispatcher when the training stage is set to "rm". The resulting reward model can be used downstream by PPO or other RLHF methods.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/train/rm/workflow.py
- Lines: 1-98
Signature
def run_rm(
model_args: "ModelArguments",
data_args: "DataArguments",
training_args: "Seq2SeqTrainingArguments",
finetuning_args: "FinetuningArguments",
callbacks: Optional[list["TrainerCallback"]] = None,
) -> None
Import
from llamafactory.train.rm.workflow import run_rm
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_args | ModelArguments | Yes | Model configuration; model is loaded with add_valuehead=True |
| data_args | DataArguments | Yes | Dataset configuration for pairwise preference data |
| training_args | Seq2SeqTrainingArguments | Yes | Training hyperparameters; do_train, do_eval, do_predict flags control workflow phases |
| finetuning_args | FinetuningArguments | Yes | Fine-tuning settings including plot_loss flag |
| callbacks | Optional[list[TrainerCallback]] | No | Additional trainer callbacks |
Outputs
| Name | Type | Description |
|---|---|---|
| (none) | None | Side effects: saves model with value head, metrics (loss and accuracy), predictions (chosen/rejected scores as JSONL), loss plots, and model card to output_dir |
Usage Examples
# Typical invocation for reward model training
from llamafactory.train.rm.workflow import run_rm
run_rm(
model_args=model_args,
data_args=data_args,
training_args=training_args,
finetuning_args=finetuning_args,
callbacks=None,
)
# When do_predict is set, predictions are saved as JSONL:
# {"chosen": 1.23, "rejected": -0.45}
# {"chosen": 0.89, "rejected": -0.12}
Related Pages
- Hiyouga_LLaMA_Factory_RM_Trainer - The PairwiseTrainer class used internally
- Hiyouga_LLaMA_Factory_RM_Metric - ComputeAccuracy metric class
- Hiyouga_LLaMA_Factory_PPO_Workflow - PPO training that uses reward models produced by this workflow
- Hiyouga_LLaMA_Factory_Callbacks - fix_valuehead_checkpoint utility for value-head checkpointing