Principle:Ggml org Llama cpp LoRA to GGUF Conversion
| Field | Value |
|---|---|
| Principle Name | LoRA to GGUF Conversion |
| Workflow | LoRA_Adapter_Workflow |
| Step | 2 of 5 |
| Domain | Tensor Format Conversion |
| Scope | Converting LoRA adapter tensors from PyTorch/safetensors format to GGUF |
Overview
Description
LoRA adapters trained with frameworks like Hugging Face PEFT are stored in PyTorch-native formats (safetensors or .bin). The llama.cpp inference engine uses its own tensor format called GGUF (GGML Universal File Format). This principle covers the theory of converting LoRA adapter tensors from their training format into GGUF, preserving the low-rank factored representation (separate A and B matrices) rather than materializing the full-rank update.
The conversion process must handle tensor name remapping (from PyTorch naming conventions to GGML naming conventions), maintain the factored A/B representation for memory efficiency, and embed adapter metadata (lora_alpha, architecture, file type) as GGUF key-value pairs.
Usage
This conversion step is required before any LoRA adapter can be used with llama.cpp at runtime, whether for dynamic application or permanent merging. The conversion:
- Translates PyTorch tensor names (e.g., base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight) to GGML tensor names (e.g., blk.0.attn_q.weight.lora_a)
- Preserves the separate A and B matrices as distinct tensors in GGUF
- Stores the lora_alpha scaling parameter as GGUF metadata
- Supports output types: f32, f16, bf16, q8_0, and auto
Theoretical Basis
The conversion theory rests on the principle that LoRA's low-rank decomposition can be represented in any tensor format without loss of mathematical fidelity, as long as the A and B matrices are preserved separately.
In the training format, each adapted weight has two associated tensors:
base_model.model.{layer}.{module}.lora_A.weight -> A matrix (r x k)
base_model.model.{layer}.{module}.lora_B.weight -> B matrix (d x r)
The conversion maps these to GGUF tensors:
{ggml_layer}.{ggml_module}.weight.lora_a -> A matrix (r x k)
{ggml_layer}.{ggml_module}.weight.lora_b -> B matrix (d x r)
A critical aspect of the conversion is the LoraTorchTensor abstraction, which wraps the paired A and B tensors as a single logical tensor. This allows the conversion pipeline to apply model-specific tensor transformations (reshaping, splitting for grouped-query attention, permutations) while maintaining the factored representation. The LoraTorchTensor class overrides standard tensor operations:
- __getitem__: Applies indexing to both A and B consistently
- reshape: Reshapes B while keeping A's row dimension intact
- permute: Routes permutations to the appropriate factor
- torch.stack / torch.cat: Stacks or concatenates factors independently
The GGUF output embeds:
- general.type = "adapter"
- adapter.type = "lora"
- adapter.lora.alpha = the alpha scaling value from training configuration