Principle:Huggingface Transformers Adapter Training
| Knowledge Sources | |
|---|---|
| Domains | Parameter_Efficient_Fine_Tuning, NLP, Model_Training |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Adapter training is the process of optimizing only the injected adapter parameters while keeping the base model weights frozen, using standard gradient-based training with the Transformers Trainer.
Description
Once adapter layers have been injected into a base model, the model is ready for training. Adapter training follows the same optimization loop as standard fine-tuning, with one critical difference: only the adapter parameters (and optionally a small set of additional trainable modules like embed_tokens or lm_head) receive gradient updates.
The Transformers Trainer class handles adapter training transparently. When it detects a PEFT-wrapped model, it automatically:
- Counts trainable parameters: Logs the number of trainable parameters, which for adapter models is a small fraction of the total
- Handles FSDP integration: Applies special FSDP plugin configuration for PEFT models via
update_fsdp_plugin_peft - Manages checkpointing: Saves only adapter weights during intermediate checkpoints (not the full model)
- Applies label smoothing: Correctly unwraps the PEFT model to identify the base model name for loss computation
Key training considerations for adapters:
- Learning rate: Adapter parameters typically benefit from higher learning rates (1e-4 to 3e-4) compared to full fine-tuning (1e-5 to 5e-5), since there are fewer parameters to optimize
- Memory efficiency: Only adapter parameters require optimizer states (e.g., momentum, variance in AdamW), reducing memory by 10-100x compared to full fine-tuning
- Gradient accumulation: Since base weights are frozen, gradient checkpointing combined with adapter training enables training on very long sequences with minimal memory
- Mixed precision: The base model can be in fp16/bf16 or even quantized (4-bit), while adapter computations may use a higher precision compute dtype
Usage
Use adapter training when you want to:
- Fine-tune a large model on a downstream task with minimal GPU memory
- Leverage the Transformers Trainer for PEFT model training with standard callbacks, logging, and evaluation
- Train QLoRA models where the base is quantized and only adapters are optimized
- Resume training from a checkpoint that includes adapter weights
Theoretical Basis
Adapter training optimizes the following objective:
min_{A, B} L(f(x; W, A, B), y)
where L is the task loss, f is the model's forward pass, W represents frozen base parameters, and A, B are the trainable adapter matrices. The gradient computation is:
dL/dA = dL/dy * dy/dA (computed via backpropagation) dL/dB = dL/dy * dy/dB
Gradients with respect to W are computed during backpropagation (they must flow through the frozen layers to reach the adapter parameters in earlier layers), but they are not applied to W because those parameters have requires_grad=False.
The memory savings come from the optimizer state. For AdamW, each trainable parameter requires storing:
- The parameter itself (4 bytes in fp32)
- First moment estimate (4 bytes)
- Second moment estimate (4 bytes)
- Gradient (4 bytes)
This totals 16 bytes per trainable parameter. For a 7B model with full fine-tuning, this requires approximately 112 GB for optimizer states alone. With LoRA (rank 16, applied to attention layers), the trainable parameters may be only ~20M, requiring approximately 320 MB for optimizer states, a 350x reduction.
Training convergence for adapter models is typically fast because:
- The pretrained weights provide a strong initialization (the model already performs well)
- The low-rank constraint acts as an implicit regularizer
- The adapter parameters directly target the residual adaptation needed for the downstream task