Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion Llama Adapter

From Leeroopedia
Revision as of 17:55, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Ollama_Ollama_GGUF_Model_Conversion_Llama_Adapter.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Model Conversion, LoRA
Last Updated 2025-02-15 00:00 GMT

Overview

Llama LoRA adapter conversion transforms HuggingFace LoRA (Low-Rank Adaptation) adapter weights into GGUF format, handling the specific tensor naming conventions for LoRA A/B matrices, automatic shape transposition detection, and Q/K weight repacking to match the base model's interleaved head layout.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

  • base_model.model. -> (stripped)
  • model.layers -> blk
  • self_attn.q_proj -> attn_q
  • self_attn.k_proj -> attn_k
  • self_attn.v_proj -> attn_v
  • self_attn.o_proj -> attn_output
  • mlp.gate_proj -> ffn_gate
  • mlp.down_proj -> ffn_down
  • mlp.up_proj -> ffn_up
  • lora_A.weight / lora_a -> weight.lora_a
  • lora_B.weight / lora_b -> weight.lora_b

Architecture-Specific Hyperparameters

The adapter GGUF file uses the following metadata:

  • general.architecture -- set to llama (matching the base model)
  • llama.attention.head_count -- copied from the base model KV store
  • llama.attention.head_count_kv -- copied from the base model KV store

The adapter converter uses the AdapterParameters base (not ModelParameters) and reads head counts from the base model's GGUF config at conversion time.

Special Handling

Automatic Shape Detection and Transposition

The converter detects when LoRA A tensors have their dimensions swapped (shape[0] > shape[1] for lora_a, or shape[0] < shape[1] for lora_b) and automatically applies a transpose operation. Two repacker variants handle this:

  • repack -- for correctly-oriented tensors (applies only Q/K head permutation)
  • repackAndTranspose -- for transposed tensors (applies transpose first, then Q/K head permutation)

Q/K Head Permutation

LoRA A matrices for attn_q and attn_k tensors undergo the same interleaved-to-contiguous head permutation as full model weights. The tensor is reshaped to [heads, 2, head_dim/2, rank], transposed to [heads, head_dim/2, 2, rank], then flattened back. This ensures LoRA updates are applied in the correct head layout expected by GGML.

Base Model Dependency

Unlike full model converters, the adapter converter requires access to the base model's GGUF config (fs.Config) to read attention head counts. This is passed through the KV(baseKV) method signature which differs from the standard KV(tokenizer) interface.

Implementation Notes

The conversion is implemented in convert/convert_llama_adapter.go via the llamaAdapter struct which satisfies the AdapterConverter interface (distinct from ModelConverter). The struct stores NumAttentionHeads which is populated from the base model config during KV() generation. LoRA adapters for non-attention layers (MLP projections) pass through without repacking.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment