Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Ggml org Llama cpp LoRA Adapter Workflow

From Leeroopedia
Knowledge Sources
Domains LLMs, Fine_Tuning, LoRA, Model_Adaptation
Last Updated 2026-02-14 22:00 GMT

Overview

End-to-end process for converting HuggingFace LoRA adapters to GGUF format and merging them into base models for customized inference.

Description

This workflow covers the complete lifecycle of using Low-Rank Adaptation (LoRA) adapters with llama.cpp. LoRA adapters are small auxiliary weight matrices that modify a base model's behavior for specific tasks or domains without changing the base weights. The workflow includes converting LoRA adapters from HuggingFace format to GGUF, optionally merging one or more adapters into a base model to create a standalone fine-tuned model, and using adapters at inference time. Multiple adapters can be combined with custom scaling factors for blended behavior.

Usage

Execute this workflow when you have a LoRA adapter trained with frameworks such as PEFT, Unsloth, or Axolotl (in HuggingFace format) and want to use it with llama.cpp. This is also appropriate when you want to permanently merge adapters into a base model for distribution or deployment, or when blending multiple LoRA adapters with different weights.

Execution Steps

Step 1: Obtain LoRA Adapter

Acquire the LoRA adapter files from HuggingFace Hub or a local training output directory. The adapter consists of low-rank weight matrices (A and B) for each adapted layer, along with metadata specifying the target model architecture and LoRA rank.

Key considerations:

  • The adapter must be compatible with the intended base model architecture
  • Adapter files are typically in safetensors or PyTorch bin format
  • The adapter's config.json specifies rank, alpha, target modules, and base model
  • Common training frameworks: PEFT, Unsloth, Axolotl, LLaMA-Factory

Step 2: Convert LoRA to GGUF

Run the convert_lora_to_gguf.py script to convert the HuggingFace-format adapter into a GGUF file. The conversion maps the adapter's tensor names to llama.cpp conventions and packages the A and B matrices with their rank metadata.

Key considerations:

  • The converter inherits architecture support from the main conversion script
  • Both A and B matrices must be present for each adapted layer
  • The adapter rank (n_rank) is preserved in the GGUF metadata
  • Output format is typically FP16 for the adapter weights

Step 3: Apply Adapter at Inference Time (Option A)

Load the LoRA adapter at inference time alongside the base model. The adapter weights are applied on-the-fly during inference, modifying the model's behavior without altering the base weights. This allows quick switching between different adapters.

Key considerations:

  • The adapter adds minimal memory overhead (typically less than 1% of base model size)
  • Multiple adapters can be loaded simultaneously
  • Scaling factor controls the adapter's influence strength
  • This approach preserves the original base model for other uses

Step 4: Merge Adapter into Base Model (Option B)

Alternatively, use the llama-export-lora tool to permanently merge one or more LoRA adapters into the base model, producing a new standalone GGUF file. The merge computes base_weight + scale * (B @ A) for each adapted layer and writes the result in FP16 format.

Key considerations:

  • Multiple adapters can be merged simultaneously with independent scaling factors
  • The merge formula is: output = base + scale * (lora_B @ lora_A) for each layer
  • Scale is computed from alpha and rank: scale = user_scale * alpha / rank
  • The output is an F16 GGUF file that can be quantized further
  • Split models are not yet supported for merging

Step 5: Quantize Merged Model (Optional)

If the adapter was merged in Step 4, the resulting FP16 model can be quantized using the standard quantization workflow to reduce its size for deployment.

Key considerations:

  • Standard quantization types (Q4_K_M, Q5_K_M, etc.) are applicable
  • Importance matrix generation should use the merged model for best results
  • The merged and quantized model behaves identically to any other quantized GGUF model

Execution Diagram

GitHub URL

Workflow Repository