Principle:Huggingface Transformers Base Model Loading For PEFT
| Knowledge Sources | |
|---|---|
| Domains | Parameter_Efficient_Fine_Tuning, NLP, Model_Loading |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Loading a pretrained base model is the foundational step in any PEFT workflow, establishing the frozen backbone onto which lightweight adapter modules will be injected.
Description
Before any parameter-efficient fine-tuning can take place, a full pretrained model must be loaded into memory. This base model provides the dense pretrained weights that encode the knowledge learned during large-scale pretraining. In the PEFT paradigm, these base weights are typically frozen (not updated during training), and only the small adapter parameters are trained.
The loading step in a PEFT context has several unique considerations compared to standard model loading:
- Quantization: PEFT methods like QLoRA combine adapter training with weight quantization (4-bit or 8-bit) to dramatically reduce memory requirements. The base model must be loaded with a
quantization_config(e.g.,BitsAndBytesConfig) to enable this. - Device placement: For large models,
device_map="auto"distributes layers across available GPUs and CPU/disk, which is essential when the full model does not fit in a single GPU's memory. - Dtype selection: Using
torch_dtype=torch.float16ortorch.bfloat16reduces memory footprint by half while preserving adequate precision for adapter training. - Adapter auto-detection: When the path provided to
from_pretrainedpoints to an adapter checkpoint rather than a base model, Transformers automatically detects theadapter_config.json, resolves thebase_model_name_or_path, loads the base model first, and then attaches the adapter.
The model loaded at this stage serves as the immutable foundation for all subsequent adapter operations: injection, training, saving, and inference.
Usage
Use base model loading for PEFT whenever you intend to:
- Fine-tune a large language model with LoRA, IA3, or other injectable adapter methods
- Perform QLoRA training by combining quantization with adapter injection
- Load a previously saved adapter by pointing
from_pretrainedat an adapter directory - Deploy a model across multiple GPUs or with CPU offloading for inference with adapters
Theoretical Basis
The PEFT paradigm is grounded in the observation that pretrained model weights capture rich representations that do not need full retraining for downstream tasks. Instead of updating all N parameters of a model (often billions), PEFT methods add a small number of trainable parameters (typically less than 1% of N) while keeping the base weights frozen.
This principle rests on the lottery ticket hypothesis and empirical findings that task-specific adaptations lie in a low-dimensional subspace of the full parameter space. By freezing the base model, we:
- Preserve the general knowledge encoded during pretraining
- Reduce the number of trainable parameters and thus GPU memory for optimizer states
- Enable multi-task serving by swapping lightweight adapters while sharing a single base model
The loading step establishes this frozen backbone. When combined with quantization (e.g., NF4 quantization in QLoRA), the memory footprint of the base model can be reduced by 4-8x, making it possible to fine-tune 65B+ parameter models on consumer GPUs.