Principle:Ggml org Llama cpp LoRA Model Merging

Field	Value
Principle Name	LoRA Model Merging
Workflow	LoRA_Adapter_Workflow
Step	4 of 5
Domain	Weight Fusion
Scope	Permanently merging LoRA adapter weights into base model parameters

Overview

Description

LoRA model merging is the process of permanently fusing LoRA adapter weights into the base model's weight matrices, producing a new standalone model file that incorporates the fine-tuned behavior without requiring the adapter at runtime. This eliminates the per-inference overhead of computing the low-rank product and simplifies deployment by reducing the number of files needed.

The merging operation takes a base GGUF model and one or more LoRA GGUF adapters, computes the scaled low-rank products, adds them to the base weights, and writes a new GGUF model file containing the merged weights.

Usage

LoRA model merging is appropriate when:

The adapter will always be used with the base model (no need for dynamic switching)
Minimizing inference latency is critical (avoids low-rank computation overhead)
Deploying a single-file model is preferred for simplicity
Multiple adapters should be baked into a single model permanently
The merged model will subsequently be quantized to a smaller format

Theoretical Basis

The merging operation materializes the low-rank update into the full weight matrix. For a single adapter, the merged weight is:

W_merged = W_base + scale * (alpha / rank) * transpose(A) @ B

For multiple adapters, the merging is additive:

W_merged = W_base + sum_i(scale_i * (alpha_i / rank_i) * transpose(A_i) @ B_i)

Where:

W_base is the original pre-trained weight matrix (potentially quantized, dequantized to F32 for computation)
A_i is the LoRA A matrix for adapter i of dimension (rank x input_dim)
B_i is the LoRA B matrix for adapter i of dimension (output_dim x rank)
rank_i is the rank of adapter i (read from B_i's first dimension: inp_b[i]->ne[0])
alpha_i is the LoRA alpha from the adapter's GGUF metadata
scale_i is the user-specified merge strength for adapter i

The implementation uses a GGML computation graph to perform the merge on a backend (CPU):

Dequantize the base tensor to F32 if it is quantized
For each adapter, compute delta = mul_mat(transpose(A), B) (or mul_mat(B, A) for token embeddings)
Scale the delta by scale * alpha / rank
Accumulate: result = base + delta_1 + delta_2 + ...
Cast the result to the output type (F16 by default)

A special case exists for token embedding tensors (token_embd), where the matrix multiplication order is reversed: delta = mul_mat(B, A) instead of mul_mat(transpose(A), B).

The output model is forced to F16 precision to accommodate the merged floating-point values, regardless of the original base model's quantization level.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment