Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft LoRA LoRA Checkpoint Saving

From Leeroopedia


Knowledge Sources
Domains Serialization, Parameter_Efficient_Fine_Tuning
Last Updated 2026-02-10 05:00 GMT

Overview

Principle of saving only LoRA parameters as compact checkpoints, separate from the frozen pretrained base model weights.

Description

Since LoRA fine-tuning only modifies a tiny fraction of the total model parameters, it is wasteful to save the entire model state dict as a checkpoint. Instead, LoRA checkpoints contain only the LoRA matrices (lora_A, lora_B) and optionally bias parameters. This produces dramatically smaller checkpoint files that can be stored, transmitted, and versioned independently of the base model.

Usage

Use LoRA checkpoint saving at every checkpoint interval during training and at the end of training. The saved checkpoint is loaded by first loading the base pretrained model, then applying the LoRA state dict with strict=False to overlay the LoRA parameters.

Theoretical Basis

Checkpoint Size Reduction

LoRA checkpoints are a tiny fraction of full model checkpoints because only the low-rank matrices are saved:

Model Full Checkpoint Size LoRA Checkpoint Size (r=4) Reduction
GPT-2 Small (124M) ~500 MB ~1.4 MB ~350x
GPT-2 Medium (355M) ~1.4 GB ~4.7 MB ~300x
GPT-2 Large (774M) ~3.1 GB ~9.4 MB ~330x

These sizes assume LoRA applied to attention Q and V projections with rank r=4.

Multiple Adapters, One Base Model

A key advantage of compact LoRA checkpoints is that multiple task-specific adapters can share a single base model. For example, one pretrained GPT-2 checkpoint (~500 MB) can be combined with dozens of LoRA adapters (~1.4 MB each) for different tasks. This is far more storage-efficient than maintaining separate full fine-tuned models for each task.

Bias Filtering Modes

The checkpoint saving function supports the same three bias modes as mark_only_lora_as_trainable to ensure consistency:

Mode Saved Parameters
none Only lora_A and lora_B matrices
all lora_A, lora_B, and all bias parameters in the model
lora_only lora_A, lora_B, and biases from LoRA-augmented layers only

Critical: The bias mode used for checkpoint saving must match the mode used during mark_only_lora_as_trainable. A mismatch will cause either missing parameters (if saving mode is stricter) or extra untrained parameters (if saving mode is more permissive).

Checkpoint Loading Pattern

LoRA checkpoints are loaded in two steps:

  1. Load the base pretrained model (full weights)
  2. Apply the LoRA state dict using load_state_dict(lora_dict, strict=False)

The strict=False flag is required because the LoRA state dict only contains a subset of the model's parameters. Without this flag, PyTorch would raise an error about missing keys.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment