Principle:Ggml org Llama cpp LoRA to GGUF Conversion

Field	Value
Principle Name	LoRA to GGUF Conversion
Workflow	LoRA_Adapter_Workflow
Step	2 of 5
Domain	Tensor Format Conversion
Scope	Converting LoRA adapter tensors from PyTorch/safetensors format to GGUF

Overview

Description

LoRA adapters trained with frameworks like Hugging Face PEFT are stored in PyTorch-native formats (safetensors or .bin). The llama.cpp inference engine uses its own tensor format called GGUF (GGML Universal File Format). This principle covers the theory of converting LoRA adapter tensors from their training format into GGUF, preserving the low-rank factored representation (separate A and B matrices) rather than materializing the full-rank update.

The conversion process must handle tensor name remapping (from PyTorch naming conventions to GGML naming conventions), maintain the factored A/B representation for memory efficiency, and embed adapter metadata (lora_alpha, architecture, file type) as GGUF key-value pairs.

Usage

This conversion step is required before any LoRA adapter can be used with llama.cpp at runtime, whether for dynamic application or permanent merging. The conversion:

Translates PyTorch tensor names (e.g., base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight) to GGML tensor names (e.g., blk.0.attn_q.weight.lora_a)
Preserves the separate A and B matrices as distinct tensors in GGUF
Stores the lora_alpha scaling parameter as GGUF metadata
Supports output types: f32, f16, bf16, q8_0, and auto

Theoretical Basis

The conversion theory rests on the principle that LoRA's low-rank decomposition can be represented in any tensor format without loss of mathematical fidelity, as long as the A and B matrices are preserved separately.

In the training format, each adapted weight has two associated tensors:

base_model.model.{layer}.{module}.lora_A.weight  ->  A matrix (r x k)
base_model.model.{layer}.{module}.lora_B.weight  ->  B matrix (d x r)

The conversion maps these to GGUF tensors:

{ggml_layer}.{ggml_module}.weight.lora_a  ->  A matrix (r x k)
{ggml_layer}.{ggml_module}.weight.lora_b  ->  B matrix (d x r)

A critical aspect of the conversion is the LoraTorchTensor abstraction, which wraps the paired A and B tensors as a single logical tensor. This allows the conversion pipeline to apply model-specific tensor transformations (reshaping, splitting for grouped-query attention, permutations) while maintaining the factored representation. The LoraTorchTensor class overrides standard tensor operations:

__getitem__: Applies indexing to both A and B consistently
reshape: Reshapes B while keeping A's row dimension intact
permute: Routes permutations to the appropriate factor
torch.stack / torch.cat: Stacks or concatenates factors independently

The GGUF output embeds:

general.type = "adapter"
adapter.type = "lora"
adapter.lora.alpha = the alpha scaling value from training configuration

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment