Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Unslothai Unsloth Model Merging And Saving

From Leeroopedia


Knowledge Sources
Domains Model_Deployment, Serialization
Last Updated 2026-02-07 00:00 GMT

Overview

A model serialization technique that merges trained LoRA adapter weights back into the base model and saves the result as a standalone model in SafeTensors format.

Description

After fine-tuning with LoRA, the model exists as a frozen base plus small adapter matrices. For deployment, these adapters must be merged into the base weights to produce a single, self-contained model. The merging process involves:

  1. Dequantization: If the base model was loaded in 4-bit, weights are dequantized back to float16 layer-by-layer to manage memory.
  2. LoRA Merge: For each adapted layer, compute Wmerged=Wbase+αrBA.
  3. Vocabulary Handling: If the vocabulary was resized during training (new tokens added), the embedding and output projection matrices are adjusted.
  4. Sharded Saving: The merged model is saved in SafeTensors format with configurable shard sizes.

The key challenge is memory management: a 7B model in float16 requires ~14GB, but during merging both the quantized and dequantized weights must coexist temporarily. Unsloth handles this with layer-by-layer dequantization controlled by maximum_memory_usage.

Usage

Use this as the final step in any fine-tuning workflow to produce a deployable model. Choose save_method="merged_16bit" for GGUF conversion or general deployment, save_method="merged_4bit" for quantized deployment, or save_method="lora" to save adapters only.

Theoretical Basis

The merge operation for each LoRA-adapted linear layer:

Wmerged=dequantize(W4bit)+αrBA

# Abstract LoRA merge process
for layer in model.layers:
    if has_lora(layer):
        W_base = dequantize(layer.weight)  # 4-bit -> float16
        W_lora = (layer.lora_alpha / layer.r) * layer.lora_B @ layer.lora_A
        layer.weight = W_base + W_lora
        remove_lora(layer)  # Clean up adapter matrices
save_safetensors(model, output_dir, shard_size="5GB")

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment