Principle:Huggingface Peft Adapter Injection
Overview
Adapter Injection is the principle of modifying a pretrained model in-place by injecting small, trainable adapter layers into targeted modules while keeping the vast majority of original model weights frozen. This enables parameter-efficient fine-tuning, where only a tiny fraction of the total parameters (typically less than 1%) are updated during training, dramatically reducing memory requirements and training cost.
Description
Adapter injection addresses a fundamental problem in transfer learning: full fine-tuning of large language models is prohibitively expensive and requires storing a complete copy of model weights for each downstream task. By injecting lightweight adapter modules into a pretrained model, practitioners can adapt models to new tasks while:
- Preserving base model integrity -- The original pretrained weights remain frozen and unmodified, ensuring that the general knowledge captured during pretraining is retained.
- Minimizing trainable parameters -- Adapter layers introduce a small number of new parameters (often less than 1% of the base model), reducing GPU memory usage and accelerating training.
- Enabling multi-task deployment -- Multiple adapter sets can be saved independently and swapped onto the same base model at inference time, avoiding the need to store full model copies for each task.
The injection process works by identifying specific target modules within the pretrained model architecture (e.g., attention projection layers such as q_proj, v_proj) and replacing them with adapter-augmented versions that wrap or extend the original module behavior. For example, in LoRA, the original weight matrix W is augmented with a low-rank decomposition BA, where B and A are small trainable matrices.
Usage
Adapter injection should be used when:
- Fine-tuning a large pretrained model on a downstream task with limited computational resources.
- Adapting a single base model to multiple tasks, where each task requires its own set of adapter weights.
- Rapid experimentation with different adapter configurations (rank, target modules, adapter type) is desired.
- Deploying models in environments where storing multiple full model copies is impractical.
Theoretical Basis
The adapter injection process follows three core steps:
Target Module Discovery
The first step is identifying which modules within the pretrained model should receive adapter layers. This is typically specified via the target_modules parameter in the adapter configuration (e.g., LoraConfig). Target modules are usually the linear projection layers in transformer attention blocks (such as q_proj, k_proj, v_proj, o_proj), but can also include feedforward layers or any other linear module. The PEFT library traverses the model's module tree to locate all modules matching the specified names or patterns.
Layer Replacement
Once target modules are identified, each one is replaced in-place with an adapter-augmented version. The exact replacement depends on the adapter method:
- LoRA -- Wraps the original linear layer with a parallel low-rank branch. The forward pass computes
output = W(x) + B(A(x)) * scaling, where W is the frozen original weight and A, B are the trainable low-rank matrices. - IA3 -- Injects learned rescaling vectors that modulate activations.
- AdaLoRA -- Uses adaptive rank allocation across layers.
- Prompt Tuning / Prefix Tuning -- Prepends trainable virtual tokens to the input rather than modifying internal layers.
The replacement is performed by the tuner class (e.g., LoraModel), which is selected via PEFT_TYPE_TO_TUNER_MAPPING based on the configuration's peft_type.
Parameter Freezing
After injection, all original model parameters have their requires_grad attribute set to False, ensuring that only the newly injected adapter parameters receive gradient updates during training. This is the key mechanism that makes the approach parameter-efficient.
Task Type Routing
A factory pattern is used to select the appropriate PeftModel subclass based on the task_type specified in the adapter configuration. The MODEL_TYPE_TO_PEFT_MODEL_MAPPING dictionary maps task types to specialized wrapper classes:
| Task Type | PeftModel Subclass |
|---|---|
CAUSAL_LM |
PeftModelForCausalLM
|
SEQ_2_SEQ_LM |
PeftModelForSeq2SeqLM
|
SEQ_CLS |
PeftModelForSequenceClassification
|
TOKEN_CLS |
PeftModelForTokenClassification
|
QUESTION_ANS |
PeftModelForQuestionAnswering
|
FEATURE_EXTRACTION |
PeftModelForFeatureExtraction
|
Each subclass provides a task-specific forward() method that handles the appropriate inputs and outputs (e.g., labels, loss computation) for that task type.