Principle:Huggingface Peft Adapter Persistence
Overview
Adapter Persistence is the principle of saving and loading only the adapter weights separately from the base model weights. Because adapter parameters represent a tiny fraction of the total model (typically a few megabytes versus gigabytes for the full model), this approach enables efficient storage, sharing, and deployment of fine-tuned models without duplicating the base model for each task.
Description
When a model is fine-tuned using parameter-efficient methods such as LoRA, only the adapter layers receive gradient updates. Adapter persistence exploits this separation by saving only the modified adapter parameters and their associated configuration, rather than the entire model checkpoint. This provides several key advantages:
- Dramatic storage reduction -- A LoRA adapter for a 7B-parameter model may be only 10--50 MB, compared to 14+ GB for the full model weights in float16. This represents a reduction of 100x or more.
- Efficient sharing and distribution -- Adapter checkpoints can be uploaded to and downloaded from the HuggingFace Hub quickly, enabling easy collaboration and model sharing.
- Multi-adapter deployment -- Multiple adapters can be stored independently and loaded onto the same base model at runtime. This allows serving many task-specific variants from a single base model instance.
- Reproducibility -- The adapter configuration (saved as JSON) captures all hyperparameters, target modules, and the base model identity, ensuring that the adapter can be correctly reconstructed.
Usage
Adapter persistence is used in the following scenarios:
- After training -- Save the trained adapter weights and configuration to disk or to the HuggingFace Hub for later use.
- For deployment -- Load a lightweight adapter onto a base model in a production environment, avoiding the need to deploy full fine-tuned model copies.
- For sharing -- Publish adapter weights to the HuggingFace Hub so others can apply them to the same base model.
- For multi-task serving -- Dynamically load and swap different adapters at inference time to serve multiple tasks from a single base model.
- For checkpoint management -- Save intermediate adapter checkpoints during training for evaluation or rollback purposes.
Theoretical Basis
Adapter persistence relies on two core mechanisms: state dict extraction for saving and state dict injection for loading.
State Dict Extraction (Saving)
When saving, the PEFT library extracts only the trainable adapter parameters from the model's full state dictionary. This is done via the get_peft_model_state_dict utility, which filters the model's state_dict() to include only keys that correspond to adapter layers. The extraction process:
- Iterates over the model's named parameters.
- Identifies parameters belonging to adapter modules (matched by naming conventions such as
lora_A,lora_B, etc.). - Optionally includes modified embedding layers if
save_embedding_layersis enabled. - Returns a filtered state dictionary containing only the adapter weights.
The filtered state dict is then serialized to disk in either safetensors format (the default, preferred for safety and speed) or standard PyTorch .bin format. Alongside the weights, the adapter configuration is saved as a JSON file (adapter_config.json), capturing all hyperparameters needed to reconstruct the adapter architecture.
Config Serialization
The adapter configuration (PeftConfig) is serialized to JSON and saved alongside the weights. This configuration includes:
- The PEFT method type (e.g.,
LORA,IA3,PREFIX_TUNING) - Adapter hyperparameters (rank, alpha, dropout, etc.)
- Target module names
- The base model name or path
- The task type
- Inference mode flag
This ensures that loading code can reconstruct the exact same adapter architecture without requiring the user to manually specify configuration details.
State Dict Injection (Loading)
When loading a saved adapter, the process is reversed:
- The adapter configuration JSON is loaded and parsed into a
PeftConfigobject. - A new
PeftModelis created around the base model using this config, which injects empty adapter layers. - The saved adapter weights are loaded from the safetensors or PyTorch file.
- The weights are injected into the appropriate adapter modules via
set_peft_model_state_dict.
This two-phase approach (architecture creation followed by weight loading) ensures correct module placement even when adapter names or model architectures have minor variations.
Hub Integration
The PEFT library integrates with the HuggingFace Hub through the PushToHubMixin, allowing adapters to be:
- Pushed to the Hub via
model.push_to_hub() - Loaded from the Hub by specifying a Hub model ID as the
model_idparameter infrom_pretrained
This enables a workflow where base models are hosted centrally and task-specific adapters are distributed as lightweight artifacts.