Principle:Ggml org Llama cpp LoRA Model Merging
| Field | Value |
|---|---|
| Principle Name | LoRA Model Merging |
| Workflow | LoRA_Adapter_Workflow |
| Step | 4 of 5 |
| Domain | Weight Fusion |
| Scope | Permanently merging LoRA adapter weights into base model parameters |
Overview
Description
LoRA model merging is the process of permanently fusing LoRA adapter weights into the base model's weight matrices, producing a new standalone model file that incorporates the fine-tuned behavior without requiring the adapter at runtime. This eliminates the per-inference overhead of computing the low-rank product and simplifies deployment by reducing the number of files needed.
The merging operation takes a base GGUF model and one or more LoRA GGUF adapters, computes the scaled low-rank products, adds them to the base weights, and writes a new GGUF model file containing the merged weights.
Usage
LoRA model merging is appropriate when:
- The adapter will always be used with the base model (no need for dynamic switching)
- Minimizing inference latency is critical (avoids low-rank computation overhead)
- Deploying a single-file model is preferred for simplicity
- Multiple adapters should be baked into a single model permanently
- The merged model will subsequently be quantized to a smaller format
Theoretical Basis
The merging operation materializes the low-rank update into the full weight matrix. For a single adapter, the merged weight is:
W_merged = W_base + scale * (alpha / rank) * transpose(A) @ B
For multiple adapters, the merging is additive:
W_merged = W_base + sum_i(scale_i * (alpha_i / rank_i) * transpose(A_i) @ B_i)
Where:
- W_base is the original pre-trained weight matrix (potentially quantized, dequantized to F32 for computation)
- A_i is the LoRA A matrix for adapter i of dimension (rank x input_dim)
- B_i is the LoRA B matrix for adapter i of dimension (output_dim x rank)
- rank_i is the rank of adapter i (read from B_i's first dimension:
inp_b[i]->ne[0]) - alpha_i is the LoRA alpha from the adapter's GGUF metadata
- scale_i is the user-specified merge strength for adapter i
The implementation uses a GGML computation graph to perform the merge on a backend (CPU):
- Dequantize the base tensor to F32 if it is quantized
- For each adapter, compute
delta = mul_mat(transpose(A), B)(ormul_mat(B, A)for token embeddings) - Scale the delta by
scale * alpha / rank - Accumulate:
result = base + delta_1 + delta_2 + ... - Cast the result to the output type (F16 by default)
A special case exists for token embedding tensors (token_embd), where the matrix multiplication order is reversed: delta = mul_mat(B, A) instead of mul_mat(transpose(A), B).
The output model is forced to F16 precision to accommodate the merged floating-point values, regardless of the original base model's quantization level.