Principle:Microsoft LoRA Parameter Freezing
| Knowledge Sources | |
|---|---|
| Domains | Training, Parameter_Efficient_Fine_Tuning |
| Last Updated | 2026-02-10 05:00 GMT |
Overview
Principle of selectively freezing pretrained model parameters so that only the low-rank LoRA matrices (and optionally biases) receive gradient updates during fine-tuning.
Description
After replacing target layers with LoRA-augmented versions, the next critical step is to freeze all pretrained parameters and mark only the LoRA matrices as trainable. This ensures that the optimizer only maintains states (momentum, variance in Adam) for the small LoRA parameters, dramatically reducing GPU memory consumption. The frozen pretrained weights act as a fixed feature extractor while the LoRA matrices learn task-specific adaptations.
Usage
Apply parameter freezing immediately after constructing the model with LoRA layers and before creating the optimizer. This step is mandatory for achieving the memory savings that make LoRA practical. The bias handling mode should be chosen to match the checkpoint saving strategy.
Theoretical Basis
Memory Reduction
The primary motivation for parameter freezing is memory efficiency. During training with Adam or AdamW, each trainable parameter requires storage for:
- The parameter itself (4 bytes for fp32)
- The gradient (4 bytes)
- The first moment estimate (4 bytes)
- The second moment estimate (4 bytes)
This means each trainable parameter consumes approximately 16 bytes of GPU memory. By freezing pretrained weights and only training LoRA parameters, the optimizer state memory is reduced proportionally to the ratio of LoRA parameters to total parameters.
For example, with GPT-2 (124M parameters) and LoRA rank r=4 applied to attention projections:
- Full fine-tuning: ~124M trainable parameters, ~1.98 GB optimizer states
- LoRA fine-tuning: ~0.35M trainable parameters, ~5.6 MB optimizer states
- Reduction: ~350x fewer optimizer states
Preventing Catastrophic Forgetting
Freezing pretrained weights also helps prevent catastrophic forgetting. Since the base model weights remain unchanged, the general knowledge captured during pretraining is preserved. The LoRA matrices only need to learn the delta required for the target task, resulting in more stable fine-tuning.
Bias Handling Modes
The parameter freezing mechanism supports three bias handling modes:
| Mode | Behavior | Use Case |
|---|---|---|
| none | Only LoRA matrices (lora_A, lora_B) are trainable; all biases are frozen | Maximum parameter reduction; default mode |
| all | LoRA matrices plus all bias parameters in the model are trainable | When bias tuning improves task performance |
| lora_only | LoRA matrices plus biases only in LoRA-augmented layers are trainable | Compromise between parameter count and flexibility |
The bias mode must be consistent between parameter freezing (mark_only_lora_as_trainable) and checkpoint saving (lora_state_dict) to ensure correct save/load behavior.