Implementation:Huggingface Trl RewardTrainer Init Train
Appearance
| Property | Value |
|---|---|
| Implementation Name | RewardTrainer Init Train |
| Technology | Huggingface TRL |
| Type | API Doc |
| Workflow | Reward Model Training |
| Paper | InstructGPT (https://arxiv.org/abs/2203.02155) |
| Principle | Principle:Huggingface_Trl_Reward_Model_Training |
Overview
Description
The RewardTrainer class is the primary entry point for training outcome-supervised reward models (ORM). Its __init__ method orchestrates model loading, tokenizer setup, PEFT wrapping, dropout disabling, data collation, and dataset preparation. The compute_loss method implements the Bradley-Terry pairwise preference loss with optional margin support and center rewards regularization.
Usage
Instantiate with a model (string path or pre-loaded), training configuration, and preference dataset. Call train() to execute the training loop.
Code Reference
Source Location
- __init__:
trl/trainer/reward_trainer.pylines 272-475 - compute_loss:
trl/trainer/reward_trainer.pylines 570-609
Signature
class RewardTrainer(BaseTrainer):
_tag_names = ["trl", "reward-trainer"]
_name = "Reward"
def __init__(
self,
model: "str | PreTrainedModel | PeftModel",
args: RewardConfig | None = None,
data_collator: DataCollator | None = None,
train_dataset: Dataset | IterableDataset | None = None,
eval_dataset: Dataset | IterableDataset | dict[str, Dataset | IterableDataset] | None = None,
processing_class: PreTrainedTokenizerBase | None = None,
compute_metrics: Callable[[EvalPrediction], dict] | None = None,
callbacks: list[TrainerCallback] | None = None,
optimizers: tuple[torch.optim.Optimizer | None, torch.optim.lr_scheduler.LambdaLR | None] = (None, None),
optimizer_cls_and_kwargs: tuple[type[torch.optim.Optimizer], dict[str, Any]] | None = None,
preprocess_logits_for_metrics: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] | None = None,
peft_config: "PeftConfig | None" = None,
):
def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
"""
Compute the Bradley-Terry preference loss.
Steps:
1. Forward pass through the model with use_cache=False.
2. Split logits into chosen and rejected rewards via torch.chunk(..., chunks=2).
3. Compute loss: -logsigmoid(r_chosen - r_rejected - margin).mean()
4. Optionally add center_rewards regularization.
5. Track metrics: accuracy, margin, min/mean/max reward.
Returns:
loss (or (loss, outputs) if return_outputs=True)
"""
Import
from trl import RewardTrainer, RewardConfig
I/O Contract
__init__ Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | str or PreTrainedModel or PeftModel | (required) | Model to train; string triggers automatic loading via AutoModelForSequenceClassification |
| args | RewardConfig or None | None | Training configuration; defaults are created from model name |
| data_collator | DataCollator or None | None | Batch collator; defaults to DataCollatorForPreference |
| train_dataset | Dataset or IterableDataset | None | Preference dataset with chosen/rejected pairs |
| eval_dataset | Dataset or IterableDataset or dict | None | Evaluation dataset(s) |
| processing_class | PreTrainedTokenizerBase or None | None | Tokenizer; auto-loaded from model config if None |
| peft_config | PeftConfig or None | None | PEFT configuration for parameter-efficient training |
compute_loss Inputs
| Parameter | Type | Description |
|---|---|---|
| model | nn.Module | The reward model (or wrapped PEFT model) |
| inputs | dict | Batch dict with "input_ids", "attention_mask", and optionally "margin" |
| return_outputs | bool | Whether to return model outputs alongside the loss |
compute_loss Outputs
| Output | Type | Description |
|---|---|---|
| loss | torch.Tensor | Bradley-Terry preference loss scalar |
| outputs | ModelOutput | (Optional) Full model forward pass outputs |
Tracked Metrics
| Metric | Computation |
|---|---|
| accuracy | mean(r_chosen > r_rejected) |
| margin | mean(r_chosen - r_rejected) |
| min_reward | min(all_rewards) |
| mean_reward | mean(all_rewards) |
| max_reward | max(all_rewards) |
| num_tokens | Cumulative token count across training |
Usage Examples
Minimal Training Example
from trl import RewardTrainer
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = RewardTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=dataset,
)
trainer.train()
Full Configuration Example
from trl import RewardTrainer, RewardConfig
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
eval_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="test")
config = RewardConfig(
output_dir="reward-model",
learning_rate=1e-4,
per_device_train_batch_size=8,
num_train_epochs=1,
max_length=512,
center_rewards_coefficient=0.01,
eval_strategy="steps",
eval_steps=500,
)
trainer = RewardTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
args=config,
train_dataset=dataset,
eval_dataset=eval_dataset,
)
trainer.train()
trainer.save_model("reward-model-final")
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment