Implementation:Hiyouga LLaMA Factory RM Trainer

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Reward Modeling, RLHF, Trainer
Last Updated	2026-02-06 19:00 GMT

Overview

PairwiseTrainer is a custom HuggingFace Trainer subclass that implements Bradley-Terry pairwise loss for reward model training.

Description

PairwiseTrainer extends the HuggingFace Trainer class, overriding compute_loss to split a concatenated batch into chosen and rejected halves, extract scalar reward scores at the last non-padding token position using gather, and compute the negative log-sigmoid of the score difference as the training loss. The class also includes FixValueHeadModelCallback for proper value-head checkpointing, custom optimizer and scheduler support, duplicate tensor deduplication in _save for safetensors compatibility, and a save_predictions method that writes chosen/rejected reward scores as JSONL.

Usage

Use PairwiseTrainer when training a reward model from human preference data in a pairwise (chosen vs. rejected) format. It is instantiated by the run_rm workflow function and expects input batches where the first half contains chosen examples and the second half contains rejected examples.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/train/rm/trainer.py
Lines: 1-150

Signature

class PairwiseTrainer(Trainer):
    def __init__(
        self,
        finetuning_args: "FinetuningArguments",
        processor: Optional["ProcessorMixin"],
        **kwargs,
    ) -> None

    def create_optimizer(self) -> "torch.optim.Optimizer"

    def create_scheduler(
        self,
        num_training_steps: int,
        optimizer: Optional["torch.optim.Optimizer"] = None,
    ) -> "torch.optim.lr_scheduler.LRScheduler"

    def compute_loss(
        self,
        model: "PreTrainedModel",
        inputs: dict[str, "torch.Tensor"],
        return_outputs: bool = False,
        **kwargs,
    ) -> Union["torch.Tensor", tuple["torch.Tensor", list["torch.Tensor"]]]

    def save_predictions(self, predict_results: "PredictionOutput") -> None

Import

from llamafactory.train.rm.trainer import PairwiseTrainer

I/O Contract

Inputs

Name	Type	Required	Description
finetuning_args	FinetuningArguments	Yes	Fine-tuning configuration including use_badam and disable_shuffling flags
processor	Optional[ProcessorMixin]	Yes	Multimodal processor; if provided, a SaveProcessorCallback is added
**kwargs	dict	Yes	Passed to parent Trainer; must include model, args, data_collator, train_dataset, etc.

Outputs

Name	Type	Description
loss (from compute_loss)	torch.Tensor	Bradley-Terry loss: -logsigmoid(chosen_score - rejected_score).mean()
outputs (from compute_loss, optional)	tuple[torch.Tensor, list[torch.Tensor]]	When return_outputs=True, returns (loss, [loss, chosen_scores, rejected_scores])
generated_predictions.jsonl (from save_predictions)	File	JSONL file with chosen and rejected reward scores per example

Usage Examples

# Typically instantiated by run_rm, not directly
from llamafactory.train.rm.trainer import PairwiseTrainer

trainer = PairwiseTrainer(
    model=model,
    args=training_args,
    finetuning_args=finetuning_args,
    data_collator=data_collator,
    callbacks=callbacks,
    compute_metrics=ComputeAccuracy(),
    processor=processor,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Training
train_result = trainer.train()

# Prediction with score saving
predict_results = trainer.predict(eval_dataset)
trainer.save_predictions(predict_results)
# Output: {"chosen": 1.23, "rejected": -0.45}

Related Pages

Hiyouga_LLaMA_Factory_RM_Workflow - The workflow orchestrator that creates and drives PairwiseTrainer
Hiyouga_LLaMA_Factory_RM_Metric - ComputeAccuracy metric used with PairwiseTrainer
Hiyouga_LLaMA_Factory_Callbacks - FixValueHeadModelCallback and SaveProcessorCallback used internally
Hiyouga_LLaMA_Factory_PPO_Workflow - PPO training that consumes the reward models produced here

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment