Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Trl RewardTrainer Init Train

From Leeroopedia


Property Value
Implementation Name RewardTrainer Init Train
Technology Huggingface TRL
Type API Doc
Workflow Reward Model Training
Paper InstructGPT (https://arxiv.org/abs/2203.02155)
Principle Principle:Huggingface_Trl_Reward_Model_Training

Overview

Description

The RewardTrainer class is the primary entry point for training outcome-supervised reward models (ORM). Its __init__ method orchestrates model loading, tokenizer setup, PEFT wrapping, dropout disabling, data collation, and dataset preparation. The compute_loss method implements the Bradley-Terry pairwise preference loss with optional margin support and center rewards regularization.

Usage

Instantiate with a model (string path or pre-loaded), training configuration, and preference dataset. Call train() to execute the training loop.

Code Reference

Source Location

  • __init__: trl/trainer/reward_trainer.py lines 272-475
  • compute_loss: trl/trainer/reward_trainer.py lines 570-609

Signature

class RewardTrainer(BaseTrainer):
    _tag_names = ["trl", "reward-trainer"]
    _name = "Reward"

    def __init__(
        self,
        model: "str | PreTrainedModel | PeftModel",
        args: RewardConfig | None = None,
        data_collator: DataCollator | None = None,
        train_dataset: Dataset | IterableDataset | None = None,
        eval_dataset: Dataset | IterableDataset | dict[str, Dataset | IterableDataset] | None = None,
        processing_class: PreTrainedTokenizerBase | None = None,
        compute_metrics: Callable[[EvalPrediction], dict] | None = None,
        callbacks: list[TrainerCallback] | None = None,
        optimizers: tuple[torch.optim.Optimizer | None, torch.optim.lr_scheduler.LambdaLR | None] = (None, None),
        optimizer_cls_and_kwargs: tuple[type[torch.optim.Optimizer], dict[str, Any]] | None = None,
        preprocess_logits_for_metrics: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] | None = None,
        peft_config: "PeftConfig | None" = None,
    ):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        """
        Compute the Bradley-Terry preference loss.

        Steps:
        1. Forward pass through the model with use_cache=False.
        2. Split logits into chosen and rejected rewards via torch.chunk(..., chunks=2).
        3. Compute loss: -logsigmoid(r_chosen - r_rejected - margin).mean()
        4. Optionally add center_rewards regularization.
        5. Track metrics: accuracy, margin, min/mean/max reward.

        Returns:
            loss (or (loss, outputs) if return_outputs=True)
        """

Import

from trl import RewardTrainer, RewardConfig

I/O Contract

__init__ Inputs

Parameter Type Default Description
model str or PreTrainedModel or PeftModel (required) Model to train; string triggers automatic loading via AutoModelForSequenceClassification
args RewardConfig or None None Training configuration; defaults are created from model name
data_collator DataCollator or None None Batch collator; defaults to DataCollatorForPreference
train_dataset Dataset or IterableDataset None Preference dataset with chosen/rejected pairs
eval_dataset Dataset or IterableDataset or dict None Evaluation dataset(s)
processing_class PreTrainedTokenizerBase or None None Tokenizer; auto-loaded from model config if None
peft_config PeftConfig or None None PEFT configuration for parameter-efficient training

compute_loss Inputs

Parameter Type Description
model nn.Module The reward model (or wrapped PEFT model)
inputs dict Batch dict with "input_ids", "attention_mask", and optionally "margin"
return_outputs bool Whether to return model outputs alongside the loss

compute_loss Outputs

Output Type Description
loss torch.Tensor Bradley-Terry preference loss scalar
outputs ModelOutput (Optional) Full model forward pass outputs

Tracked Metrics

Metric Computation
accuracy mean(r_chosen > r_rejected)
margin mean(r_chosen - r_rejected)
min_reward min(all_rewards)
mean_reward mean(all_rewards)
max_reward max(all_rewards)
num_tokens Cumulative token count across training

Usage Examples

Minimal Training Example

from trl import RewardTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    train_dataset=dataset,
)
trainer.train()

Full Configuration Example

from trl import RewardTrainer, RewardConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
eval_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="test")

config = RewardConfig(
    output_dir="reward-model",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    num_train_epochs=1,
    max_length=512,
    center_rewards_coefficient=0.01,
    eval_strategy="steps",
    eval_steps=500,
)

trainer = RewardTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    args=config,
    train_dataset=dataset,
    eval_dataset=eval_dataset,
)
trainer.train()
trainer.save_model("reward-model-final")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment