Principle:Microsoft LoRA Parameter Freezing

Knowledge Sources	Microsoft LoRA LoRA
Domains	Training, Parameter_Efficient_Fine_Tuning
Last Updated	2026-02-10 05:00 GMT

Overview

Principle of selectively freezing pretrained model parameters so that only the low-rank LoRA matrices (and optionally biases) receive gradient updates during fine-tuning.

Description

After replacing target layers with LoRA-augmented versions, the next critical step is to freeze all pretrained parameters and mark only the LoRA matrices as trainable. This ensures that the optimizer only maintains states (momentum, variance in Adam) for the small LoRA parameters, dramatically reducing GPU memory consumption. The frozen pretrained weights act as a fixed feature extractor while the LoRA matrices learn task-specific adaptations.

Usage

Apply parameter freezing immediately after constructing the model with LoRA layers and before creating the optimizer. This step is mandatory for achieving the memory savings that make LoRA practical. The bias handling mode should be chosen to match the checkpoint saving strategy.

Theoretical Basis

Memory Reduction

The primary motivation for parameter freezing is memory efficiency. During training with Adam or AdamW, each trainable parameter requires storage for:

The parameter itself (4 bytes for fp32)
The gradient (4 bytes)
The first moment estimate (4 bytes)
The second moment estimate (4 bytes)

This means each trainable parameter consumes approximately 16 bytes of GPU memory. By freezing pretrained weights and only training LoRA parameters, the optimizer state memory is reduced proportionally to the ratio of LoRA parameters to total parameters.

For example, with GPT-2 (124M parameters) and LoRA rank r=4 applied to attention projections:

Full fine-tuning: ~124M trainable parameters, ~1.98 GB optimizer states
LoRA fine-tuning: ~0.35M trainable parameters, ~5.6 MB optimizer states
Reduction: ~350x fewer optimizer states

Preventing Catastrophic Forgetting

Freezing pretrained weights also helps prevent catastrophic forgetting. Since the base model weights remain unchanged, the general knowledge captured during pretraining is preserved. The LoRA matrices only need to learn the delta required for the target task, resulting in more stable fine-tuning.

Bias Handling Modes

The parameter freezing mechanism supports three bias handling modes:

Mode	Behavior	Use Case
none	Only LoRA matrices (lora_A, lora_B) are trainable; all biases are frozen	Maximum parameter reduction; default mode
all	LoRA matrices plus all bias parameters in the model are trainable	When bias tuning improves task performance
lora_only	LoRA matrices plus biases only in LoRA-augmented layers are trainable	Compromise between parameter count and flexibility

The bias mode must be consistent between parameter freezing (mark_only_lora_as_trainable) and checkpoint saving (lora_state_dict) to ensure correct save/load behavior.

Related Pages

Implemented By

Implementation:Microsoft_LoRA_Mark_Only_LoRA_Trainable

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment