Principle:Huggingface Trl Reward Sequence Classifier Loading
| Property | Value |
|---|---|
| Principle Name | Reward Sequence Classifier Loading |
| Technology | Huggingface TRL |
| Category | Model Architecture |
| Workflow | Reward Model Training |
| Implementation | Implementation:Huggingface_Trl_AutoModelForSequenceClassification_From_Pretrained |
Overview
Description
Reward models in RLHF are constructed by loading a pretrained language model and attaching a linear classification head that produces a single scalar reward value. In Huggingface TRL, this is accomplished by loading the model through AutoModelForSequenceClassification with num_labels=1, which configures the model as a single-label regression head rather than a multi-class classifier.
The key insight is that a pretrained causal language model (e.g., GPT-2, LLaMA) is repurposed as a sequence classifier. The model's language understanding capabilities are preserved in the backbone, while a new randomly initialized linear layer (the "score" head) is added on top to project the final hidden state into a scalar reward value.
Usage
The model loading is handled internally by RewardTrainer when a model path string is provided instead of a pre-instantiated model. The create_model_from_path utility function orchestrates the loading with the AutoModelForSequenceClassification architecture class.
Theoretical Basis
Sequence Classification Head
The sequence classification architecture adds a linear projection layer on top of the language model backbone:
reward = W * h_last + b
where h_last is the hidden state of the last non-padding token and W is a learned weight matrix of shape (hidden_size, 1). Setting num_labels=1 ensures the output is a single scalar rather than a vector of class logits.
Single-Label Regression
With num_labels=1, the model is configured for regression rather than classification. This means:
- No softmax is applied to the output.
- The raw logit value serves directly as the reward score.
- The model can output any real-valued scalar, which is essential for the Bradley-Terry preference model where reward differences determine preference probabilities.
Weight Initialization
When loading a causal LM checkpoint as a sequence classifier, the classification head weights are randomly initialized because the pretrained checkpoint does not contain them. TRL suppresses the Transformers warning about these uninitialized weights since training the classification head is the explicit purpose of reward model training.
To ensure reproducibility of the random head initialization, set_seed is called before model loading. This guarantees that the same seed produces identical initial reward heads across runs.
Pad Token Configuration
Sequence classification models require a pad_token_id to identify which token's hidden state to use for classification. The reward is computed from the last non-padding token in the sequence. TRL sets the pad token from the configuration, falling back to the EOS token if no pad token is defined.