Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Transformers Adapter Training

From Leeroopedia
Knowledge Sources
Domains Parameter_Efficient_Fine_Tuning, NLP, Model_Training
Last Updated 2026-02-13 00:00 GMT

Overview

Adapter training is the process of optimizing only the injected adapter parameters while keeping the base model weights frozen, using standard gradient-based training with the Transformers Trainer.

Description

Once adapter layers have been injected into a base model, the model is ready for training. Adapter training follows the same optimization loop as standard fine-tuning, with one critical difference: only the adapter parameters (and optionally a small set of additional trainable modules like embed_tokens or lm_head) receive gradient updates.

The Transformers Trainer class handles adapter training transparently. When it detects a PEFT-wrapped model, it automatically:

  • Counts trainable parameters: Logs the number of trainable parameters, which for adapter models is a small fraction of the total
  • Handles FSDP integration: Applies special FSDP plugin configuration for PEFT models via update_fsdp_plugin_peft
  • Manages checkpointing: Saves only adapter weights during intermediate checkpoints (not the full model)
  • Applies label smoothing: Correctly unwraps the PEFT model to identify the base model name for loss computation

Key training considerations for adapters:

  • Learning rate: Adapter parameters typically benefit from higher learning rates (1e-4 to 3e-4) compared to full fine-tuning (1e-5 to 5e-5), since there are fewer parameters to optimize
  • Memory efficiency: Only adapter parameters require optimizer states (e.g., momentum, variance in AdamW), reducing memory by 10-100x compared to full fine-tuning
  • Gradient accumulation: Since base weights are frozen, gradient checkpointing combined with adapter training enables training on very long sequences with minimal memory
  • Mixed precision: The base model can be in fp16/bf16 or even quantized (4-bit), while adapter computations may use a higher precision compute dtype

Usage

Use adapter training when you want to:

  • Fine-tune a large model on a downstream task with minimal GPU memory
  • Leverage the Transformers Trainer for PEFT model training with standard callbacks, logging, and evaluation
  • Train QLoRA models where the base is quantized and only adapters are optimized
  • Resume training from a checkpoint that includes adapter weights

Theoretical Basis

Adapter training optimizes the following objective:

min_{A, B} L(f(x; W, A, B), y)

where L is the task loss, f is the model's forward pass, W represents frozen base parameters, and A, B are the trainable adapter matrices. The gradient computation is:

dL/dA = dL/dy * dy/dA (computed via backpropagation) dL/dB = dL/dy * dy/dB

Gradients with respect to W are computed during backpropagation (they must flow through the frozen layers to reach the adapter parameters in earlier layers), but they are not applied to W because those parameters have requires_grad=False.

The memory savings come from the optimizer state. For AdamW, each trainable parameter requires storing:

  • The parameter itself (4 bytes in fp32)
  • First moment estimate (4 bytes)
  • Second moment estimate (4 bytes)
  • Gradient (4 bytes)

This totals 16 bytes per trainable parameter. For a 7B model with full fine-tuning, this requires approximately 112 GB for optimizer states alone. With LoRA (rank 16, applied to attention layers), the trainable parameters may be only ~20M, requiring approximately 320 MB for optimizer states, a 350x reduction.

Training convergence for adapter models is typically fast because:

  1. The pretrained weights provide a strong initialization (the model already performs well)
  2. The low-rank constraint acts as an implicit regularizer
  3. The adapter parameters directly target the residual adaptation needed for the downstream task

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment