Implementation:Ggml org Llama cpp LoRA Merge Ctx

Field	Value
Implementation Name	LoRA Merge Ctx
Doc Type	Wrapper Doc
Workflow	LoRA_Adapter_Workflow
Step	4 of 5
Source File	`tools/export-lora/export-lora.cpp`

Overview

Description

The lora_merge_ctx struct and its run_merge() method implement permanent merging of one or more LoRA adapters into a base model. The tool reads a base GGUF model and LoRA GGUF adapter files, computes the merged weights using a GGML computation graph on the CPU backend, and writes a new F16 GGUF model file containing the fused weights.

The implementation validates adapter compatibility (architecture match, adapter type), handles dequantization of quantized base tensors, and supports multiple simultaneous adapter merges with independent scaling factors.

Usage

llama-export-lora -m base-model.gguf --lora lora-file.gguf -o merged-model-f16.gguf

Code Reference

Field	Value
Source Location	`tools/export-lora/export-lora.cpp:114-434`
Struct Definition	`tools/export-lora/export-lora.cpp:114` (`struct lora_merge_ctx`)
Constructor	`tools/export-lora/export-lora.cpp:130-158`
run_merge	`tools/export-lora/export-lora.cpp:186-271`
merge_tensor	`tools/export-lora/export-lora.cpp:281-396`
main	`tools/export-lora/export-lora.cpp:413-434`
Import	`#include "ggml.h"`, `#include "gguf.h"`, `#include "common.h"`

lora_merge_ctx struct:

struct lora_merge_ctx {
    // input base model + adapters
    file_input base_model;
    std::vector<std::unique_ptr<file_input>> adapters;

    // for computing merged tensor
    int n_threads;
    ggml_backend_t backend = nullptr;
    ggml_gallocr_t allocr = nullptr;
    std::vector<uint8_t> read_buf;

    // output file
    struct gguf_context * ctx_out;
    struct ggml_context * ctx_out_ggml;
    std::ofstream fout;

    lora_merge_ctx(
            std::string & base_fname,
            std::vector<common_adapter_lora_info> & lora_files,
            std::string & outfile,
            int n_threads);

    void run_merge();
    void copy_tensor(struct ggml_tensor * base);
    void merge_tensor(struct ggml_tensor * base, struct ggml_tensor * out);
};

merge_tensor core computation (graph build):

struct ggml_tensor * cur = inp_base;
for (size_t i = 0; i < adapters.size(); ++i) {
    struct ggml_tensor * delta;
    bool is_tok_embd = string_starts_with(name_base, "token_embd");
    if (is_tok_embd) {
        delta = ggml_mul_mat(ctx0,
            ggml_cast(ctx0, inp_b[i], GGML_TYPE_F32),
            ggml_cast(ctx0, inp_a[i], GGML_TYPE_F32));
    } else {
        delta = ggml_mul_mat(ctx0,
            ggml_cont(ctx0, ggml_transpose(ctx0, ggml_cast(ctx0, inp_a[i], GGML_TYPE_F32))),
            ggml_cast(ctx0, inp_b[i], GGML_TYPE_F32));
    }
    // scale
    const float alpha = adapters[i]->alpha;
    const float rank  = (float) inp_b[i]->ne[0];
    const float scale = alpha ? adapters[i]->scale * alpha / rank : adapters[i]->scale;
    delta = ggml_scale(ctx0, delta, scale);
    cur = ggml_add(ctx0, delta, cur);
}
cur = ggml_cast(ctx0, cur, out->type);

Constructor (validates adapters):

void check_metadata_lora(file_input * adapter) {
    auto general_type = get_kv_str(adapter->ctx_gguf, "general.type");
    if (general_type != "adapter") {
        throw std::runtime_error("expect general.type to be 'adapter', but got: " + general_type);
    }
    auto adapter_type = get_kv_str(adapter->ctx_gguf, "adapter.type");
    if (adapter_type != "lora") {
        throw std::runtime_error("expect adapter.type to be 'lora', but got: " + adapter_type);
    }
    auto general_arch_base = get_kv_str(base_model.ctx_gguf, "general.architecture");
    auto general_arch_lora = get_kv_str(adapter->ctx_gguf, "general.architecture");
    if (general_arch_base != general_arch_lora) {
        throw std::runtime_error("model arch and LoRA arch mismatch");
    }
}

I/O Contract

Direction	Name	Type	Description
Input	base_fname	`std::string`	Path to the base GGUF model file
Input	lora_files	`std::vector<common_adapter_lora_info>`	List of LoRA adapter files with their scaling factors
Input	outfile	`std::string`	Path for the output merged GGUF model (default: `ggml-lora-merged-f16.gguf`)
Input	n_threads	`int`	Number of CPU threads for computation
Output	merged GGUF file	binary file	F16 GGUF model with LoRA weights permanently merged into base weights

Constraints:

Split models are not supported
All adapters must have the same list of tensors (subset merging not yet supported)
Quantized LoRA adapters are not supported (must use f16 or f32 adapters)
Output is always F16 regardless of base model quantization

Usage Examples

Merge a single LoRA adapter:

./llama-export-lora \
    -m base-model-q4.gguf \
    --lora my-adapter.gguf \
    -o merged-model-f16.gguf

Merge multiple LoRA adapters:

./llama-export-lora \
    -m base-model.gguf \
    --lora adapter1.gguf \
    --lora adapter2.gguf \
    -o merged-multi-f16.gguf

Full pipeline (merge then quantize):

# Step 1: Merge LoRA into base model (output is F16)
./llama-export-lora -m base.gguf --lora adapter.gguf -o merged-f16.gguf

# Step 2: Quantize the merged model
./llama-quantize merged-f16.gguf merged-q4_k_m.gguf Q4_K_M

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment