Workflow:Huggingface Transformers PEFT Adapter Integration
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Training, PEFT |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
End-to-end process for adding, training, and managing Parameter-Efficient Fine-Tuning (PEFT) adapters on pretrained Transformer models using LoRA.
Description
This workflow covers the integration of PEFT adapters (primarily LoRA) with pretrained Transformers models. Instead of fine-tuning all model parameters, PEFT injects small trainable adapter matrices into frozen model layers, reducing the number of trainable parameters to less than 1% of the total. The process includes loading a base model, configuring and injecting LoRA adapters, training only the adapter weights, saving the lightweight adapter artifacts, and loading adapters for inference. Adapters can be combined with quantized base models for maximum memory efficiency.
Usage
Execute this workflow when you need to adapt a large pretrained model to a specific task or domain but have limited compute resources or storage. PEFT is ideal when you want to fine-tune multiple task-specific variants of the same base model without storing full copies of each. It is also the standard approach for QLoRA workflows where the base model is quantized.
Execution Steps
Step 1: Load Base Model
Load the pretrained base model that will serve as the foundation for adapter training. The model can be loaded at full precision or with quantization (see Model Quantization workflow). The base model weights will remain frozen during adapter training.
Key considerations:
- Use AutoModelForCausalLM.from_pretrained() or task-specific Auto class
- For QLoRA, load with BitsAndBytesConfig(load_in_4bit=True)
- Model architecture determines which layers can receive adapters
- Gradient checkpointing can be enabled for additional memory savings
Step 2: Configure LoRA Adapter
Create a LoraConfig specifying the adapter hyperparameters: rank (r), alpha scaling factor, target modules, and dropout. The rank controls the size of the low-rank matrices; lower rank means fewer parameters but potentially lower quality adaptation.
Key considerations:
- r (rank) is typically 8-64; higher rank captures more complex adaptations
- lora_alpha scaling factor is usually set to 2x the rank
- target_modules can be auto-detected or specified (e.g., attention query/value projections)
- modules_to_save specifies layers whose full weights should also be saved (e.g., lm_head)
Step 3: Inject Adapter into Model
Apply the LoRA configuration to the base model using model.add_adapter(). This injects trainable low-rank matrices alongside the frozen base weight matrices in the specified target modules. The model is now ready for adapter training.
Key considerations:
- add_adapter() modifies the model in-place, wrapping target layers with adapter logic
- After injection, only adapter parameters have requires_grad=True
- Multiple named adapters can be added for multi-task scenarios
- Verify injection by checking for BaseTunerLayer instances in the model
Step 4: Train Adapter
Train the model using the standard Trainer workflow (see Model Training with Trainer) or a custom training loop. Only adapter weights are updated during training; the base model remains frozen.
Key considerations:
- Training uses the standard Trainer API with no special configuration needed
- Learning rate may need adjustment (adapter training often uses higher LR than full fine-tuning)
- Memory usage during training is significantly lower than full fine-tuning
- The Trainer automatically handles which parameters to optimize
Step 5: Save Adapter Weights
Save only the adapter weights and configuration to disk or HuggingFace Hub. The saved artifacts are small (typically a few MB) and contain only the LoRA matrices and configuration, not the full model weights.
Key considerations:
- model.save_pretrained() saves adapter_model.safetensors and adapter_config.json
- Base model weights are NOT included in the saved artifacts
- Multiple adapters can be saved independently
- Adapter files are small enough to version control
Step 6: Load Adapter for Inference
Load the base model and apply the saved adapter weights for inference. The adapter can be loaded from a local directory or directly from HuggingFace Hub. Multiple adapters can be swapped at inference time without reloading the base model.
Key considerations:
- Load base model first, then apply adapter with model.load_adapter()
- Alternatively, AutoModelForCausalLM.from_pretrained() detects and loads adapters automatically
- model.enable_adapters() and model.disable_adapters() toggle adapter use
- Adapter merging (model.merge_and_unload()) permanently fuses adapter into base weights