Implementation:Ggml org Llama cpp LoRA Merge Ctx
| Field | Value |
|---|---|
| Implementation Name | LoRA Merge Ctx |
| Doc Type | Wrapper Doc |
| Workflow | LoRA_Adapter_Workflow |
| Step | 4 of 5 |
| Source File | tools/export-lora/export-lora.cpp
|
Overview
Description
The lora_merge_ctx struct and its run_merge() method implement permanent merging of one or more LoRA adapters into a base model. The tool reads a base GGUF model and LoRA GGUF adapter files, computes the merged weights using a GGML computation graph on the CPU backend, and writes a new F16 GGUF model file containing the fused weights.
The implementation validates adapter compatibility (architecture match, adapter type), handles dequantization of quantized base tensors, and supports multiple simultaneous adapter merges with independent scaling factors.
Usage
llama-export-lora -m base-model.gguf --lora lora-file.gguf -o merged-model-f16.gguf
Code Reference
| Field | Value |
|---|---|
| Source Location | tools/export-lora/export-lora.cpp:114-434
|
| Struct Definition | tools/export-lora/export-lora.cpp:114 (struct lora_merge_ctx)
|
| Constructor | tools/export-lora/export-lora.cpp:130-158
|
| run_merge | tools/export-lora/export-lora.cpp:186-271
|
| merge_tensor | tools/export-lora/export-lora.cpp:281-396
|
| main | tools/export-lora/export-lora.cpp:413-434
|
| Import | #include "ggml.h", #include "gguf.h", #include "common.h"
|
lora_merge_ctx struct:
struct lora_merge_ctx {
// input base model + adapters
file_input base_model;
std::vector<std::unique_ptr<file_input>> adapters;
// for computing merged tensor
int n_threads;
ggml_backend_t backend = nullptr;
ggml_gallocr_t allocr = nullptr;
std::vector<uint8_t> read_buf;
// output file
struct gguf_context * ctx_out;
struct ggml_context * ctx_out_ggml;
std::ofstream fout;
lora_merge_ctx(
std::string & base_fname,
std::vector<common_adapter_lora_info> & lora_files,
std::string & outfile,
int n_threads);
void run_merge();
void copy_tensor(struct ggml_tensor * base);
void merge_tensor(struct ggml_tensor * base, struct ggml_tensor * out);
};
merge_tensor core computation (graph build):
struct ggml_tensor * cur = inp_base;
for (size_t i = 0; i < adapters.size(); ++i) {
struct ggml_tensor * delta;
bool is_tok_embd = string_starts_with(name_base, "token_embd");
if (is_tok_embd) {
delta = ggml_mul_mat(ctx0,
ggml_cast(ctx0, inp_b[i], GGML_TYPE_F32),
ggml_cast(ctx0, inp_a[i], GGML_TYPE_F32));
} else {
delta = ggml_mul_mat(ctx0,
ggml_cont(ctx0, ggml_transpose(ctx0, ggml_cast(ctx0, inp_a[i], GGML_TYPE_F32))),
ggml_cast(ctx0, inp_b[i], GGML_TYPE_F32));
}
// scale
const float alpha = adapters[i]->alpha;
const float rank = (float) inp_b[i]->ne[0];
const float scale = alpha ? adapters[i]->scale * alpha / rank : adapters[i]->scale;
delta = ggml_scale(ctx0, delta, scale);
cur = ggml_add(ctx0, delta, cur);
}
cur = ggml_cast(ctx0, cur, out->type);
Constructor (validates adapters):
void check_metadata_lora(file_input * adapter) {
auto general_type = get_kv_str(adapter->ctx_gguf, "general.type");
if (general_type != "adapter") {
throw std::runtime_error("expect general.type to be 'adapter', but got: " + general_type);
}
auto adapter_type = get_kv_str(adapter->ctx_gguf, "adapter.type");
if (adapter_type != "lora") {
throw std::runtime_error("expect adapter.type to be 'lora', but got: " + adapter_type);
}
auto general_arch_base = get_kv_str(base_model.ctx_gguf, "general.architecture");
auto general_arch_lora = get_kv_str(adapter->ctx_gguf, "general.architecture");
if (general_arch_base != general_arch_lora) {
throw std::runtime_error("model arch and LoRA arch mismatch");
}
}
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | base_fname | std::string |
Path to the base GGUF model file |
| Input | lora_files | std::vector<common_adapter_lora_info> |
List of LoRA adapter files with their scaling factors |
| Input | outfile | std::string |
Path for the output merged GGUF model (default: ggml-lora-merged-f16.gguf)
|
| Input | n_threads | int |
Number of CPU threads for computation |
| Output | merged GGUF file | binary file | F16 GGUF model with LoRA weights permanently merged into base weights |
Constraints:
- Split models are not supported
- All adapters must have the same list of tensors (subset merging not yet supported)
- Quantized LoRA adapters are not supported (must use f16 or f32 adapters)
- Output is always F16 regardless of base model quantization
Usage Examples
Merge a single LoRA adapter:
./llama-export-lora \
-m base-model-q4.gguf \
--lora my-adapter.gguf \
-o merged-model-f16.gguf
Merge multiple LoRA adapters:
./llama-export-lora \
-m base-model.gguf \
--lora adapter1.gguf \
--lora adapter2.gguf \
-o merged-multi-f16.gguf
Full pipeline (merge then quantize):
# Step 1: Merge LoRA into base model (output is F16)
./llama-export-lora -m base.gguf --lora adapter.gguf -o merged-f16.gguf
# Step 2: Quantize the merged model
./llama-quantize merged-f16.gguf merged-q4_k_m.gguf Q4_K_M