Overview
Description
The RewardTrainer class is the primary entry point for training outcome-supervised reward models (ORM). Its __init__ method orchestrates model loading, tokenizer setup, PEFT wrapping, dropout disabling, data collation, and dataset preparation. The compute_loss method implements the Bradley-Terry pairwise preference loss with optional margin support and center rewards regularization.
Usage
Instantiate with a model (string path or pre-loaded), training configuration, and preference dataset. Call train() to execute the training loop.
Code Reference
Source Location
- __init__:
trl/trainer/reward_trainer.py lines 272-475
- compute_loss:
trl/trainer/reward_trainer.py lines 570-609
Signature
class RewardTrainer(BaseTrainer):
_tag_names = ["trl", "reward-trainer"]
_name = "Reward"
def __init__(
self,
model: "str | PreTrainedModel | PeftModel",
args: RewardConfig | None = None,
data_collator: DataCollator | None = None,
train_dataset: Dataset | IterableDataset | None = None,
eval_dataset: Dataset | IterableDataset | dict[str, Dataset | IterableDataset] | None = None,
processing_class: PreTrainedTokenizerBase | None = None,
compute_metrics: Callable[[EvalPrediction], dict] | None = None,
callbacks: list[TrainerCallback] | None = None,
optimizers: tuple[torch.optim.Optimizer | None, torch.optim.lr_scheduler.LambdaLR | None] = (None, None),
optimizer_cls_and_kwargs: tuple[type[torch.optim.Optimizer], dict[str, Any]] | None = None,
preprocess_logits_for_metrics: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] | None = None,
peft_config: "PeftConfig | None" = None,
):
def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
"""
Compute the Bradley-Terry preference loss.
Steps:
1. Forward pass through the model with use_cache=False.
2. Split logits into chosen and rejected rewards via torch.chunk(..., chunks=2).
3. Compute loss: -logsigmoid(r_chosen - r_rejected - margin).mean()
4. Optionally add center_rewards regularization.
5. Track metrics: accuracy, margin, min/mean/max reward.
Returns:
loss (or (loss, outputs) if return_outputs=True)
"""
Import
from trl import RewardTrainer, RewardConfig
I/O Contract
__init__ Inputs
| Parameter |
Type |
Default |
Description
|
| model |
str or PreTrainedModel or PeftModel |
(required) |
Model to train; string triggers automatic loading via AutoModelForSequenceClassification
|
| args |
RewardConfig or None |
None |
Training configuration; defaults are created from model name
|
| data_collator |
DataCollator or None |
None |
Batch collator; defaults to DataCollatorForPreference
|
| train_dataset |
Dataset or IterableDataset |
None |
Preference dataset with chosen/rejected pairs
|
| eval_dataset |
Dataset or IterableDataset or dict |
None |
Evaluation dataset(s)
|
| processing_class |
PreTrainedTokenizerBase or None |
None |
Tokenizer; auto-loaded from model config if None
|
| peft_config |
PeftConfig or None |
None |
PEFT configuration for parameter-efficient training
|
compute_loss Inputs
| Parameter |
Type |
Description
|
| model |
nn.Module |
The reward model (or wrapped PEFT model)
|
| inputs |
dict |
Batch dict with "input_ids", "attention_mask", and optionally "margin"
|
| return_outputs |
bool |
Whether to return model outputs alongside the loss
|
compute_loss Outputs
| Output |
Type |
Description
|
| loss |
torch.Tensor |
Bradley-Terry preference loss scalar
|
| outputs |
ModelOutput |
(Optional) Full model forward pass outputs
|
Tracked Metrics
| Metric |
Computation
|
| accuracy |
mean(r_chosen > r_rejected)
|
| margin |
mean(r_chosen - r_rejected)
|
| min_reward |
min(all_rewards)
|
| mean_reward |
mean(all_rewards)
|
| max_reward |
max(all_rewards)
|
| num_tokens |
Cumulative token count across training
|
Usage Examples
Minimal Training Example
from trl import RewardTrainer
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = RewardTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=dataset,
)
trainer.train()
Full Configuration Example
from trl import RewardTrainer, RewardConfig
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
eval_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="test")
config = RewardConfig(
output_dir="reward-model",
learning_rate=1e-4,
per_device_train_batch_size=8,
num_train_epochs=1,
max_length=512,
center_rewards_coefficient=0.01,
eval_strategy="steps",
eval_steps=500,
)
trainer = RewardTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
args=config,
train_dataset=dataset,
eval_dataset=eval_dataset,
)
trainer.train()
trainer.save_model("reward-model-final")
Related Pages