Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Trl PEFT LoRA Configuration Reward

From Leeroopedia


Property Value
Principle Name PEFT LoRA Configuration for Reward
Technology Huggingface TRL, PEFT
Category Parameter-Efficient Fine-Tuning
Workflow Reward Model Training
Implementation Implementation:Huggingface_Trl_Get_Peft_Config_Reward

Overview

Description

When training reward models with limited compute or when preserving the base model's capabilities is important, LoRA (Low-Rank Adaptation) provides a parameter-efficient alternative to full fine-tuning. However, applying LoRA to reward model training requires careful task-specific configuration that differs from standard causal language modeling LoRA setups.

The critical distinction is the task_type parameter: reward models must use SEQ_CLS (sequence classification) rather than CAUSAL_LM (causal language modeling). Additionally, the classification head (typically named "score") must be included in modules_to_save to ensure it remains fully trainable, since LoRA adapters only modify existing weight matrices and cannot train randomly initialized heads.

Usage

The get_peft_config utility function creates a LoraConfig from the ModelConfig dataclass fields. The resulting PeftConfig is passed to the RewardTrainer via the peft_config parameter, where it is applied using get_peft_model during trainer initialization.

Theoretical Basis

Task-Specific LoRA

LoRA decomposes weight updates into low-rank matrices:

W' = W + BA

where B has shape (d, r) and A has shape (r, k) with rank r much smaller than d and k. The task type determines how PEFT handles the model architecture:

  • SEQ_CLS: Configures LoRA for sequence classification models. PEFT expects the model to have a classification head and handles the forward pass accordingly, extracting the reward from the last token position.
  • CAUSAL_LM: Configures LoRA for autoregressive text generation. Using this task type for reward models would result in incorrect behavior because the model's output structure would be treated as next-token prediction logits rather than scalar rewards.

Modules to Save

The modules_to_save parameter specifies model components that should be fully fine-tuned (not adapted via LoRA). For reward models, the "score" head must be listed because:

  • The score head is randomly initialized and needs full gradient updates to learn meaningful reward predictions.
  • LoRA adapters modify existing pretrained weights via low-rank updates, but they cannot serve as a replacement for training a new head from scratch.
  • Without including "score" in modules_to_save, the reward head would remain at its random initialization and produce meaningless outputs.

SEQ_CLS vs CAUSAL_LM

Property SEQ_CLS CAUSAL_LM
Output Single scalar per sequence Logits per token
Loss Pairwise preference (Bradley-Terry) Next-token cross-entropy
Head Linear projection to 1 label LM head (vocabulary projection)
Use case Reward models, classifiers Text generation, SFT

Gradient Checkpointing with PEFT

When combining LoRA with gradient checkpointing (which is enabled by default in RewardConfig), TRL explicitly calls model.enable_input_require_grads() to ensure gradients flow correctly through the PEFT adapter layers. This addresses a known interaction issue between Transformers gradient checkpointing and PEFT adapter training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment