Principle:Lm sys FastChat LoRA Adapter Saving
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training, Model Persistence |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Saving LoRA adapter weights separately from the frozen base model, with special handling for DeepSpeed ZeRO-3 parameter gathering and configurable bias extraction strategies.
Description
After LoRA training completes, only the small adapter weights need to be persisted -- the base model weights are unchanged and can be loaded independently from the original checkpoint. This principle covers the process of extracting, gathering, and saving these adapter parameters, with particular attention to the complexities introduced by DeepSpeed ZeRO-3 parameter partitioning.
The adapter saving process involves several key aspects:
- Separate Adapter Storage -- LoRA adapters are saved independently from the base model as
adapter_model.bin(weights) andadapter_config.json(configuration). This separation means each adapter is typically only 10-50 MB for a 7B model, compared to 14+ GB for the full model. - DeepSpeed ZeRO-3 Parameter Gathering -- Under ZeRO-3, model parameters are partitioned (sharded) across GPUs. To save the complete model state, these partitioned parameters must be gathered onto a single process. FastChat handles this through two mechanisms:
- For ZeRO-3: Uses the DeepSpeed internal function
_zero3_consolidated_16bit_state_dict()to gather all parameters (base + adapter) into a complete 16-bit state dict. PEFT'ssave_pretrained()then automatically extracts only the LoRA parameters. - For non-ZeRO-3 (including ZeRO-2): Uses a custom
get_peft_state_maybe_zero_3()function to manually filter and gather only LoRA parameters.
- For ZeRO-3: Uses the DeepSpeed internal function
- Parameter Gathering with maybe_zero_3 -- Individual parameters under ZeRO-3 have a
ds_idattribute and a status ofNOT_AVAILABLE(meaning the full parameter is not locally resident). Themaybe_zero_3()helper useszero.GatheredParameters()as a context manager to temporarily gather the full parameter, then detaches, moves to CPU, and clones it. - Bias Handling Strategies -- The
lora_biasargument controls which bias tensors are saved alongside the LoRA adapter weights:"none"-- Only parameters containing"lora_"in their name are saved. No bias terms are included."all"-- All parameters containing"lora_"or"bias"in their name are saved."lora_only"-- LoRA parameters are saved, plus only those bias terms that belong to LoRA-modified layers (identified by matching the bias name prefix to a LoRA parameter name prefix).
- Rank-0 Only Saving -- The actual file write is gated by
training_args.local_rank == 0, ensuring only the primary process writes to disk. This prevents file corruption from concurrent writes in distributed training. - Post-Training Merge (apply_lora) -- After saving, the adapter can be merged back into the base model using
fastchat/model/apply_lora.py. This script loads the base model, loads the LoRA adapter viaPeftModel.from_pretrained(), callsmerge_and_unload()to produce a single merged model, and saves the result. The merged model can then be served without the PEFT library.
Usage
Use this pattern when:
- Saving LoRA adapter weights after fine-tuning for later inference or distribution.
- Training with DeepSpeed ZeRO-3 where parameters are partitioned and need gathering.
- You want to maintain multiple task-specific adapters for the same base model.
- Preparing adapters for merging with the base model via
apply_lora.
Do not use this pattern when:
- Performing full fine-tuning (save the entire model instead).
- The base model weights have been modified (LoRA assumes frozen base weights).
Theoretical Basis
Adapter Weight Extraction: Given a LoRA-wrapped model with parameters theta = {W_0, B_i, A_i, bias_j}, the save process extracts only the adapter subset:
For bias="none":
state_dict = {k: v for k, v in named_params if "lora_" in k}
For bias="all":
state_dict = {k: v for k, v in named_params if "lora_" in k or "bias" in k}
For bias="lora_only":
state_dict = {k: v for k, v in named_params if "lora_" in k}
state_dict += {k: v for k, v in named_params
if "bias" in k and corresponding_lora_layer_exists(k)}
ZeRO-3 Parameter Gathering: Under ZeRO-3, each parameter p of size N is partitioned across P processes, with each process holding N/P elements. To access the full parameter:
# Parameter is partitioned: p.ds_status == NOT_AVAILABLE
with zero.GatheredParameters([p]):
# All-gather operation collects N/P elements from each of P processes
# Full parameter is temporarily available on this process
p_full = p.data.detach().cpu().clone()
# After context manager exits, parameter is re-partitioned
Consolidated State Dict (ZeRO-3): The _zero3_consolidated_16bit_state_dict() method performs a bulk all-gather of all model parameters into FP16, which is more efficient than gathering parameters one at a time. PEFT's save_pretrained() then filters this complete state dict to extract only LoRA-related keys.
Output Files:
| File | Contents | Typical Size (7B model, r=8) |
|---|---|---|
adapter_model.bin |
LoRA weight matrices (A and B for each target module) | ~17 MB |
adapter_config.json |
LoRA configuration (rank, alpha, target modules, etc.) | ~1 KB |
Post-Merge via apply_lora: The merge operation combines the base weights with the adapter:
W_merged = W_0 + (alpha / r) * B @ A
After merging, the model has the same architecture as the original with no LoRA layers, enabling inference without the PEFT library.