Principle:Huggingface Trl PEFT LoRA Configuration Reward
| Property | Value |
|---|---|
| Principle Name | PEFT LoRA Configuration for Reward |
| Technology | Huggingface TRL, PEFT |
| Category | Parameter-Efficient Fine-Tuning |
| Workflow | Reward Model Training |
| Implementation | Implementation:Huggingface_Trl_Get_Peft_Config_Reward |
Overview
Description
When training reward models with limited compute or when preserving the base model's capabilities is important, LoRA (Low-Rank Adaptation) provides a parameter-efficient alternative to full fine-tuning. However, applying LoRA to reward model training requires careful task-specific configuration that differs from standard causal language modeling LoRA setups.
The critical distinction is the task_type parameter: reward models must use SEQ_CLS (sequence classification) rather than CAUSAL_LM (causal language modeling). Additionally, the classification head (typically named "score") must be included in modules_to_save to ensure it remains fully trainable, since LoRA adapters only modify existing weight matrices and cannot train randomly initialized heads.
Usage
The get_peft_config utility function creates a LoraConfig from the ModelConfig dataclass fields. The resulting PeftConfig is passed to the RewardTrainer via the peft_config parameter, where it is applied using get_peft_model during trainer initialization.
Theoretical Basis
Task-Specific LoRA
LoRA decomposes weight updates into low-rank matrices:
W' = W + BA
where B has shape (d, r) and A has shape (r, k) with rank r much smaller than d and k. The task type determines how PEFT handles the model architecture:
- SEQ_CLS: Configures LoRA for sequence classification models. PEFT expects the model to have a classification head and handles the forward pass accordingly, extracting the reward from the last token position.
- CAUSAL_LM: Configures LoRA for autoregressive text generation. Using this task type for reward models would result in incorrect behavior because the model's output structure would be treated as next-token prediction logits rather than scalar rewards.
Modules to Save
The modules_to_save parameter specifies model components that should be fully fine-tuned (not adapted via LoRA). For reward models, the "score" head must be listed because:
- The score head is randomly initialized and needs full gradient updates to learn meaningful reward predictions.
- LoRA adapters modify existing pretrained weights via low-rank updates, but they cannot serve as a replacement for training a new head from scratch.
- Without including "score" in modules_to_save, the reward head would remain at its random initialization and produce meaningless outputs.
SEQ_CLS vs CAUSAL_LM
| Property | SEQ_CLS | CAUSAL_LM |
|---|---|---|
| Output | Single scalar per sequence | Logits per token |
| Loss | Pairwise preference (Bradley-Terry) | Next-token cross-entropy |
| Head | Linear projection to 1 label | LM head (vocabulary projection) |
| Use case | Reward models, classifiers | Text generation, SFT |
Gradient Checkpointing with PEFT
When combining LoRA with gradient checkpointing (which is enabled by default in RewardConfig), TRL explicitly calls model.enable_input_require_grads() to ensure gradients flow correctly through the PEFT adapter layers. This addresses a known interaction issue between Transformers gradient checkpointing and PEFT adapter training.