Principle:Unslothai Unsloth Model Merging And Saving
| Knowledge Sources | |
|---|---|
| Domains | Model_Deployment, Serialization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A model serialization technique that merges trained LoRA adapter weights back into the base model and saves the result as a standalone model in SafeTensors format.
Description
After fine-tuning with LoRA, the model exists as a frozen base plus small adapter matrices. For deployment, these adapters must be merged into the base weights to produce a single, self-contained model. The merging process involves:
- Dequantization: If the base model was loaded in 4-bit, weights are dequantized back to float16 layer-by-layer to manage memory.
- LoRA Merge: For each adapted layer, compute .
- Vocabulary Handling: If the vocabulary was resized during training (new tokens added), the embedding and output projection matrices are adjusted.
- Sharded Saving: The merged model is saved in SafeTensors format with configurable shard sizes.
The key challenge is memory management: a 7B model in float16 requires ~14GB, but during merging both the quantized and dequantized weights must coexist temporarily. Unsloth handles this with layer-by-layer dequantization controlled by maximum_memory_usage.
Usage
Use this as the final step in any fine-tuning workflow to produce a deployable model. Choose save_method="merged_16bit" for GGUF conversion or general deployment, save_method="merged_4bit" for quantized deployment, or save_method="lora" to save adapters only.
Theoretical Basis
The merge operation for each LoRA-adapted linear layer:
# Abstract LoRA merge process
for layer in model.layers:
if has_lora(layer):
W_base = dequantize(layer.weight) # 4-bit -> float16
W_lora = (layer.lora_alpha / layer.r) * layer.lora_B @ layer.lora_A
layer.weight = W_base + W_lora
remove_lora(layer) # Clean up adapter matrices
save_safetensors(model, output_dir, shard_size="5GB")