Workflow:Huggingface Transformers PEFT Adapter Integration

Knowledge Sources	Huggingface Transformers PEFT Library PEFT Documentation
Domains	LLMs, Fine_Tuning, Training, PEFT
Last Updated	2026-02-13 20:00 GMT

Overview

End-to-end process for adding, training, and managing Parameter-Efficient Fine-Tuning (PEFT) adapters on pretrained Transformer models using LoRA.

Description

This workflow covers the integration of PEFT adapters (primarily LoRA) with pretrained Transformers models. Instead of fine-tuning all model parameters, PEFT injects small trainable adapter matrices into frozen model layers, reducing the number of trainable parameters to less than 1% of the total. The process includes loading a base model, configuring and injecting LoRA adapters, training only the adapter weights, saving the lightweight adapter artifacts, and loading adapters for inference. Adapters can be combined with quantized base models for maximum memory efficiency.

Usage

Execute this workflow when you need to adapt a large pretrained model to a specific task or domain but have limited compute resources or storage. PEFT is ideal when you want to fine-tune multiple task-specific variants of the same base model without storing full copies of each. It is also the standard approach for QLoRA workflows where the base model is quantized.

Execution Steps

Step 1: Load Base Model

Load the pretrained base model that will serve as the foundation for adapter training. The model can be loaded at full precision or with quantization (see Model Quantization workflow). The base model weights will remain frozen during adapter training.

Key considerations:

Use AutoModelForCausalLM.from_pretrained() or task-specific Auto class
For QLoRA, load with BitsAndBytesConfig(load_in_4bit=True)
Model architecture determines which layers can receive adapters
Gradient checkpointing can be enabled for additional memory savings

Step 2: Configure LoRA Adapter

Create a LoraConfig specifying the adapter hyperparameters: rank (r), alpha scaling factor, target modules, and dropout. The rank controls the size of the low-rank matrices; lower rank means fewer parameters but potentially lower quality adaptation.

Key considerations:

r (rank) is typically 8-64; higher rank captures more complex adaptations
lora_alpha scaling factor is usually set to 2x the rank
target_modules can be auto-detected or specified (e.g., attention query/value projections)
modules_to_save specifies layers whose full weights should also be saved (e.g., lm_head)

Step 3: Inject Adapter into Model

Apply the LoRA configuration to the base model using model.add_adapter(). This injects trainable low-rank matrices alongside the frozen base weight matrices in the specified target modules. The model is now ready for adapter training.

Key considerations:

add_adapter() modifies the model in-place, wrapping target layers with adapter logic
After injection, only adapter parameters have requires_grad=True
Multiple named adapters can be added for multi-task scenarios
Verify injection by checking for BaseTunerLayer instances in the model

Step 4: Train Adapter

Train the model using the standard Trainer workflow (see Model Training with Trainer) or a custom training loop. Only adapter weights are updated during training; the base model remains frozen.

Key considerations:

Training uses the standard Trainer API with no special configuration needed
Learning rate may need adjustment (adapter training often uses higher LR than full fine-tuning)
Memory usage during training is significantly lower than full fine-tuning
The Trainer automatically handles which parameters to optimize

Step 5: Save Adapter Weights

Save only the adapter weights and configuration to disk or HuggingFace Hub. The saved artifacts are small (typically a few MB) and contain only the LoRA matrices and configuration, not the full model weights.

Key considerations:

model.save_pretrained() saves adapter_model.safetensors and adapter_config.json
Base model weights are NOT included in the saved artifacts
Multiple adapters can be saved independently
Adapter files are small enough to version control

Step 6: Load Adapter for Inference

Load the base model and apply the saved adapter weights for inference. The adapter can be loaded from a local directory or directly from HuggingFace Hub. Multiple adapters can be swapped at inference time without reloading the base model.

Key considerations:

Load base model first, then apply adapter with model.load_adapter()
Alternatively, AutoModelForCausalLM.from_pretrained() detects and loads adapters automatically
model.enable_adapters() and model.disable_adapters() toggle adapter use
Adapter merging (model.merge_and_unload()) permanently fuses adapter into base weights

Execution Diagram

GitHub URL

Workflow Repository