Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ggml org Llama cpp LoRA to GGUF Conversion

From Leeroopedia
Field Value
Principle Name LoRA to GGUF Conversion
Workflow LoRA_Adapter_Workflow
Step 2 of 5
Domain Tensor Format Conversion
Scope Converting LoRA adapter tensors from PyTorch/safetensors format to GGUF

Overview

Description

LoRA adapters trained with frameworks like Hugging Face PEFT are stored in PyTorch-native formats (safetensors or .bin). The llama.cpp inference engine uses its own tensor format called GGUF (GGML Universal File Format). This principle covers the theory of converting LoRA adapter tensors from their training format into GGUF, preserving the low-rank factored representation (separate A and B matrices) rather than materializing the full-rank update.

The conversion process must handle tensor name remapping (from PyTorch naming conventions to GGML naming conventions), maintain the factored A/B representation for memory efficiency, and embed adapter metadata (lora_alpha, architecture, file type) as GGUF key-value pairs.

Usage

This conversion step is required before any LoRA adapter can be used with llama.cpp at runtime, whether for dynamic application or permanent merging. The conversion:

  • Translates PyTorch tensor names (e.g., base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight) to GGML tensor names (e.g., blk.0.attn_q.weight.lora_a)
  • Preserves the separate A and B matrices as distinct tensors in GGUF
  • Stores the lora_alpha scaling parameter as GGUF metadata
  • Supports output types: f32, f16, bf16, q8_0, and auto

Theoretical Basis

The conversion theory rests on the principle that LoRA's low-rank decomposition can be represented in any tensor format without loss of mathematical fidelity, as long as the A and B matrices are preserved separately.

In the training format, each adapted weight has two associated tensors:

base_model.model.{layer}.{module}.lora_A.weight  ->  A matrix (r x k)
base_model.model.{layer}.{module}.lora_B.weight  ->  B matrix (d x r)

The conversion maps these to GGUF tensors:

{ggml_layer}.{ggml_module}.weight.lora_a  ->  A matrix (r x k)
{ggml_layer}.{ggml_module}.weight.lora_b  ->  B matrix (d x r)

A critical aspect of the conversion is the LoraTorchTensor abstraction, which wraps the paired A and B tensors as a single logical tensor. This allows the conversion pipeline to apply model-specific tensor transformations (reshaping, splitting for grouped-query attention, permutations) while maintaining the factored representation. The LoraTorchTensor class overrides standard tensor operations:

  • __getitem__: Applies indexing to both A and B consistently
  • reshape: Reshapes B while keeping A's row dimension intact
  • permute: Routes permutations to the appropriate factor
  • torch.stack / torch.cat: Stacks or concatenates factors independently

The GGUF output embeds:

  • general.type = "adapter"
  • adapter.type = "lora"
  • adapter.lora.alpha = the alpha scaling value from training configuration

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment