Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat LoRA Adapter Saving

From Leeroopedia


Knowledge Sources
Domains NLP, Training, Model Persistence
Last Updated 2026-02-07 14:00 GMT

Overview

Saving LoRA adapter weights separately from the frozen base model, with special handling for DeepSpeed ZeRO-3 parameter gathering and configurable bias extraction strategies.

Description

After LoRA training completes, only the small adapter weights need to be persisted -- the base model weights are unchanged and can be loaded independently from the original checkpoint. This principle covers the process of extracting, gathering, and saving these adapter parameters, with particular attention to the complexities introduced by DeepSpeed ZeRO-3 parameter partitioning.

The adapter saving process involves several key aspects:

  1. Separate Adapter Storage -- LoRA adapters are saved independently from the base model as adapter_model.bin (weights) and adapter_config.json (configuration). This separation means each adapter is typically only 10-50 MB for a 7B model, compared to 14+ GB for the full model.
  2. DeepSpeed ZeRO-3 Parameter Gathering -- Under ZeRO-3, model parameters are partitioned (sharded) across GPUs. To save the complete model state, these partitioned parameters must be gathered onto a single process. FastChat handles this through two mechanisms:
    • For ZeRO-3: Uses the DeepSpeed internal function _zero3_consolidated_16bit_state_dict() to gather all parameters (base + adapter) into a complete 16-bit state dict. PEFT's save_pretrained() then automatically extracts only the LoRA parameters.
    • For non-ZeRO-3 (including ZeRO-2): Uses a custom get_peft_state_maybe_zero_3() function to manually filter and gather only LoRA parameters.
  3. Parameter Gathering with maybe_zero_3 -- Individual parameters under ZeRO-3 have a ds_id attribute and a status of NOT_AVAILABLE (meaning the full parameter is not locally resident). The maybe_zero_3() helper uses zero.GatheredParameters() as a context manager to temporarily gather the full parameter, then detaches, moves to CPU, and clones it.
  4. Bias Handling Strategies -- The lora_bias argument controls which bias tensors are saved alongside the LoRA adapter weights:
    • "none" -- Only parameters containing "lora_" in their name are saved. No bias terms are included.
    • "all" -- All parameters containing "lora_" or "bias" in their name are saved.
    • "lora_only" -- LoRA parameters are saved, plus only those bias terms that belong to LoRA-modified layers (identified by matching the bias name prefix to a LoRA parameter name prefix).
  5. Rank-0 Only Saving -- The actual file write is gated by training_args.local_rank == 0, ensuring only the primary process writes to disk. This prevents file corruption from concurrent writes in distributed training.
  6. Post-Training Merge (apply_lora) -- After saving, the adapter can be merged back into the base model using fastchat/model/apply_lora.py. This script loads the base model, loads the LoRA adapter via PeftModel.from_pretrained(), calls merge_and_unload() to produce a single merged model, and saves the result. The merged model can then be served without the PEFT library.

Usage

Use this pattern when:

  • Saving LoRA adapter weights after fine-tuning for later inference or distribution.
  • Training with DeepSpeed ZeRO-3 where parameters are partitioned and need gathering.
  • You want to maintain multiple task-specific adapters for the same base model.
  • Preparing adapters for merging with the base model via apply_lora.

Do not use this pattern when:

  • Performing full fine-tuning (save the entire model instead).
  • The base model weights have been modified (LoRA assumes frozen base weights).

Theoretical Basis

Adapter Weight Extraction: Given a LoRA-wrapped model with parameters theta = {W_0, B_i, A_i, bias_j}, the save process extracts only the adapter subset:

For bias="none":
    state_dict = {k: v for k, v in named_params if "lora_" in k}

For bias="all":
    state_dict = {k: v for k, v in named_params if "lora_" in k or "bias" in k}

For bias="lora_only":
    state_dict = {k: v for k, v in named_params if "lora_" in k}
    state_dict += {k: v for k, v in named_params
                   if "bias" in k and corresponding_lora_layer_exists(k)}

ZeRO-3 Parameter Gathering: Under ZeRO-3, each parameter p of size N is partitioned across P processes, with each process holding N/P elements. To access the full parameter:

# Parameter is partitioned: p.ds_status == NOT_AVAILABLE
with zero.GatheredParameters([p]):
    # All-gather operation collects N/P elements from each of P processes
    # Full parameter is temporarily available on this process
    p_full = p.data.detach().cpu().clone()
# After context manager exits, parameter is re-partitioned

Consolidated State Dict (ZeRO-3): The _zero3_consolidated_16bit_state_dict() method performs a bulk all-gather of all model parameters into FP16, which is more efficient than gathering parameters one at a time. PEFT's save_pretrained() then filters this complete state dict to extract only LoRA-related keys.

Output Files:

File Contents Typical Size (7B model, r=8)
adapter_model.bin LoRA weight matrices (A and B for each target module) ~17 MB
adapter_config.json LoRA configuration (rank, alpha, target modules, etc.) ~1 KB

Post-Merge via apply_lora: The merge operation combines the base weights with the adapter:

W_merged = W_0 + (alpha / r) * B @ A

After merging, the model has the same architecture as the original with no LoRA layers, enabling inference without the PEFT library.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment