Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp LoRA Merge Ctx

From Leeroopedia
Field Value
Implementation Name LoRA Merge Ctx
Doc Type Wrapper Doc
Workflow LoRA_Adapter_Workflow
Step 4 of 5
Source File tools/export-lora/export-lora.cpp

Overview

Description

The lora_merge_ctx struct and its run_merge() method implement permanent merging of one or more LoRA adapters into a base model. The tool reads a base GGUF model and LoRA GGUF adapter files, computes the merged weights using a GGML computation graph on the CPU backend, and writes a new F16 GGUF model file containing the fused weights.

The implementation validates adapter compatibility (architecture match, adapter type), handles dequantization of quantized base tensors, and supports multiple simultaneous adapter merges with independent scaling factors.

Usage

llama-export-lora -m base-model.gguf --lora lora-file.gguf -o merged-model-f16.gguf

Code Reference

Field Value
Source Location tools/export-lora/export-lora.cpp:114-434
Struct Definition tools/export-lora/export-lora.cpp:114 (struct lora_merge_ctx)
Constructor tools/export-lora/export-lora.cpp:130-158
run_merge tools/export-lora/export-lora.cpp:186-271
merge_tensor tools/export-lora/export-lora.cpp:281-396
main tools/export-lora/export-lora.cpp:413-434
Import #include "ggml.h", #include "gguf.h", #include "common.h"

lora_merge_ctx struct:

struct lora_merge_ctx {
    // input base model + adapters
    file_input base_model;
    std::vector<std::unique_ptr<file_input>> adapters;

    // for computing merged tensor
    int n_threads;
    ggml_backend_t backend = nullptr;
    ggml_gallocr_t allocr = nullptr;
    std::vector<uint8_t> read_buf;

    // output file
    struct gguf_context * ctx_out;
    struct ggml_context * ctx_out_ggml;
    std::ofstream fout;

    lora_merge_ctx(
            std::string & base_fname,
            std::vector<common_adapter_lora_info> & lora_files,
            std::string & outfile,
            int n_threads);

    void run_merge();
    void copy_tensor(struct ggml_tensor * base);
    void merge_tensor(struct ggml_tensor * base, struct ggml_tensor * out);
};

merge_tensor core computation (graph build):

struct ggml_tensor * cur = inp_base;
for (size_t i = 0; i < adapters.size(); ++i) {
    struct ggml_tensor * delta;
    bool is_tok_embd = string_starts_with(name_base, "token_embd");
    if (is_tok_embd) {
        delta = ggml_mul_mat(ctx0,
            ggml_cast(ctx0, inp_b[i], GGML_TYPE_F32),
            ggml_cast(ctx0, inp_a[i], GGML_TYPE_F32));
    } else {
        delta = ggml_mul_mat(ctx0,
            ggml_cont(ctx0, ggml_transpose(ctx0, ggml_cast(ctx0, inp_a[i], GGML_TYPE_F32))),
            ggml_cast(ctx0, inp_b[i], GGML_TYPE_F32));
    }
    // scale
    const float alpha = adapters[i]->alpha;
    const float rank  = (float) inp_b[i]->ne[0];
    const float scale = alpha ? adapters[i]->scale * alpha / rank : adapters[i]->scale;
    delta = ggml_scale(ctx0, delta, scale);
    cur = ggml_add(ctx0, delta, cur);
}
cur = ggml_cast(ctx0, cur, out->type);

Constructor (validates adapters):

void check_metadata_lora(file_input * adapter) {
    auto general_type = get_kv_str(adapter->ctx_gguf, "general.type");
    if (general_type != "adapter") {
        throw std::runtime_error("expect general.type to be 'adapter', but got: " + general_type);
    }
    auto adapter_type = get_kv_str(adapter->ctx_gguf, "adapter.type");
    if (adapter_type != "lora") {
        throw std::runtime_error("expect adapter.type to be 'lora', but got: " + adapter_type);
    }
    auto general_arch_base = get_kv_str(base_model.ctx_gguf, "general.architecture");
    auto general_arch_lora = get_kv_str(adapter->ctx_gguf, "general.architecture");
    if (general_arch_base != general_arch_lora) {
        throw std::runtime_error("model arch and LoRA arch mismatch");
    }
}

I/O Contract

Direction Name Type Description
Input base_fname std::string Path to the base GGUF model file
Input lora_files std::vector<common_adapter_lora_info> List of LoRA adapter files with their scaling factors
Input outfile std::string Path for the output merged GGUF model (default: ggml-lora-merged-f16.gguf)
Input n_threads int Number of CPU threads for computation
Output merged GGUF file binary file F16 GGUF model with LoRA weights permanently merged into base weights

Constraints:

  • Split models are not supported
  • All adapters must have the same list of tensors (subset merging not yet supported)
  • Quantized LoRA adapters are not supported (must use f16 or f32 adapters)
  • Output is always F16 regardless of base model quantization

Usage Examples

Merge a single LoRA adapter:

./llama-export-lora \
    -m base-model-q4.gguf \
    --lora my-adapter.gguf \
    -o merged-model-f16.gguf

Merge multiple LoRA adapters:

./llama-export-lora \
    -m base-model.gguf \
    --lora adapter1.gguf \
    --lora adapter2.gguf \
    -o merged-multi-f16.gguf

Full pipeline (merge then quantize):

# Step 1: Merge LoRA into base model (output is F16)
./llama-export-lora -m base.gguf --lora adapter.gguf -o merged-f16.gguf

# Step 2: Quantize the merged model
./llama-quantize merged-f16.gguf merged-q4_k_m.gguf Q4_K_M

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment