Principle:Huggingface Transformers Adapter Training

Knowledge Sources	LoRA QLoRA PEFT Docs Transformers Docs
Domains	Parameter_Efficient_Fine_Tuning, NLP, Model_Training
Last Updated	2026-02-13 00:00 GMT

Overview

Adapter training is the process of optimizing only the injected adapter parameters while keeping the base model weights frozen, using standard gradient-based training with the Transformers Trainer.

Description

Once adapter layers have been injected into a base model, the model is ready for training. Adapter training follows the same optimization loop as standard fine-tuning, with one critical difference: only the adapter parameters (and optionally a small set of additional trainable modules like embed_tokens or lm_head) receive gradient updates.

The Transformers Trainer class handles adapter training transparently. When it detects a PEFT-wrapped model, it automatically:

Counts trainable parameters: Logs the number of trainable parameters, which for adapter models is a small fraction of the total
Handles FSDP integration: Applies special FSDP plugin configuration for PEFT models via update_fsdp_plugin_peft
Manages checkpointing: Saves only adapter weights during intermediate checkpoints (not the full model)
Applies label smoothing: Correctly unwraps the PEFT model to identify the base model name for loss computation

Key training considerations for adapters:

Learning rate: Adapter parameters typically benefit from higher learning rates (1e-4 to 3e-4) compared to full fine-tuning (1e-5 to 5e-5), since there are fewer parameters to optimize
Memory efficiency: Only adapter parameters require optimizer states (e.g., momentum, variance in AdamW), reducing memory by 10-100x compared to full fine-tuning
Gradient accumulation: Since base weights are frozen, gradient checkpointing combined with adapter training enables training on very long sequences with minimal memory
Mixed precision: The base model can be in fp16/bf16 or even quantized (4-bit), while adapter computations may use a higher precision compute dtype

Usage

Use adapter training when you want to:

Fine-tune a large model on a downstream task with minimal GPU memory
Leverage the Transformers Trainer for PEFT model training with standard callbacks, logging, and evaluation
Train QLoRA models where the base is quantized and only adapters are optimized
Resume training from a checkpoint that includes adapter weights

Theoretical Basis

Adapter training optimizes the following objective:

min_{A, B} L(f(x; W, A, B), y)

where L is the task loss, f is the model's forward pass, W represents frozen base parameters, and A, B are the trainable adapter matrices. The gradient computation is:

dL/dA = dL/dy * dy/dA (computed via backpropagation) dL/dB = dL/dy * dy/dB

Gradients with respect to W are computed during backpropagation (they must flow through the frozen layers to reach the adapter parameters in earlier layers), but they are not applied to W because those parameters have requires_grad=False.

The memory savings come from the optimizer state. For AdamW, each trainable parameter requires storing:

The parameter itself (4 bytes in fp32)
First moment estimate (4 bytes)
Second moment estimate (4 bytes)
Gradient (4 bytes)

This totals 16 bytes per trainable parameter. For a 7B model with full fine-tuning, this requires approximately 112 GB for optimizer states alone. With LoRA (rank 16, applied to attention layers), the trainable parameters may be only ~20M, requiring approximately 320 MB for optimizer states, a 350x reduction.

Training convergence for adapter models is typically fast because:

The pretrained weights provide a strong initialization (the model already performs well)
The low-rank constraint acts as an implicit regularizer
The adapter parameters directly target the residual adaptation needed for the downstream task

Related Pages

Implemented By

Implementation:Huggingface_Transformers_Trainer_Train_For_PEFT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment