Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft LoRA Parameter Freezing

From Leeroopedia


Knowledge Sources
Domains Training, Parameter_Efficient_Fine_Tuning
Last Updated 2026-02-10 05:00 GMT

Overview

Principle of selectively freezing pretrained model parameters so that only the low-rank LoRA matrices (and optionally biases) receive gradient updates during fine-tuning.

Description

After replacing target layers with LoRA-augmented versions, the next critical step is to freeze all pretrained parameters and mark only the LoRA matrices as trainable. This ensures that the optimizer only maintains states (momentum, variance in Adam) for the small LoRA parameters, dramatically reducing GPU memory consumption. The frozen pretrained weights act as a fixed feature extractor while the LoRA matrices learn task-specific adaptations.

Usage

Apply parameter freezing immediately after constructing the model with LoRA layers and before creating the optimizer. This step is mandatory for achieving the memory savings that make LoRA practical. The bias handling mode should be chosen to match the checkpoint saving strategy.

Theoretical Basis

Memory Reduction

The primary motivation for parameter freezing is memory efficiency. During training with Adam or AdamW, each trainable parameter requires storage for:

  • The parameter itself (4 bytes for fp32)
  • The gradient (4 bytes)
  • The first moment estimate (4 bytes)
  • The second moment estimate (4 bytes)

This means each trainable parameter consumes approximately 16 bytes of GPU memory. By freezing pretrained weights and only training LoRA parameters, the optimizer state memory is reduced proportionally to the ratio of LoRA parameters to total parameters.

For example, with GPT-2 (124M parameters) and LoRA rank r=4 applied to attention projections:

  • Full fine-tuning: ~124M trainable parameters, ~1.98 GB optimizer states
  • LoRA fine-tuning: ~0.35M trainable parameters, ~5.6 MB optimizer states
  • Reduction: ~350x fewer optimizer states

Preventing Catastrophic Forgetting

Freezing pretrained weights also helps prevent catastrophic forgetting. Since the base model weights remain unchanged, the general knowledge captured during pretraining is preserved. The LoRA matrices only need to learn the delta required for the target task, resulting in more stable fine-tuning.

Bias Handling Modes

The parameter freezing mechanism supports three bias handling modes:

Mode Behavior Use Case
none Only LoRA matrices (lora_A, lora_B) are trainable; all biases are frozen Maximum parameter reduction; default mode
all LoRA matrices plus all bias parameters in the model are trainable When bias tuning improves task performance
lora_only LoRA matrices plus biases only in LoRA-augmented layers are trainable Compromise between parameter count and flexibility

The bias mode must be consistent between parameter freezing (mark_only_lora_as_trainable) and checkpoint saving (lora_state_dict) to ensure correct save/load behavior.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment