Principle:Ollama Ollama GGUF Model Conversion Llama Adapter
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, LoRA |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Llama LoRA adapter conversion transforms HuggingFace LoRA (Low-Rank Adaptation) adapter weights into GGUF format, handling the specific tensor naming conventions for LoRA A/B matrices, automatic shape transposition detection, and Q/K weight repacking to match the base model's interleaved head layout.
Core Concepts
Tensor Name Mapping
The converter applies the following HuggingFace-to-GGUF tensor name replacements:
base_model.model.-> (stripped)model.layers->blkself_attn.q_proj->attn_qself_attn.k_proj->attn_kself_attn.v_proj->attn_vself_attn.o_proj->attn_outputmlp.gate_proj->ffn_gatemlp.down_proj->ffn_downmlp.up_proj->ffn_uplora_A.weight/lora_a->weight.lora_alora_B.weight/lora_b->weight.lora_b
Architecture-Specific Hyperparameters
The adapter GGUF file uses the following metadata:
general.architecture-- set tollama(matching the base model)llama.attention.head_count-- copied from the base model KV storellama.attention.head_count_kv-- copied from the base model KV store
The adapter converter uses the AdapterParameters base (not ModelParameters) and reads head counts from the base model's GGUF config at conversion time.
Special Handling
Automatic Shape Detection and Transposition
The converter detects when LoRA A tensors have their dimensions swapped (shape[0] > shape[1] for lora_a, or shape[0] < shape[1] for lora_b) and automatically applies a transpose operation. Two repacker variants handle this:
repack-- for correctly-oriented tensors (applies only Q/K head permutation)repackAndTranspose-- for transposed tensors (applies transpose first, then Q/K head permutation)
Q/K Head Permutation
LoRA A matrices for attn_q and attn_k tensors undergo the same interleaved-to-contiguous head permutation as full model weights. The tensor is reshaped to [heads, 2, head_dim/2, rank], transposed to [heads, head_dim/2, 2, rank], then flattened back. This ensures LoRA updates are applied in the correct head layout expected by GGML.
Base Model Dependency
Unlike full model converters, the adapter converter requires access to the base model's GGUF config (fs.Config) to read attention head counts. This is passed through the KV(baseKV) method signature which differs from the standard KV(tokenizer) interface.
Implementation Notes
The conversion is implemented in convert/convert_llama_adapter.go via the llamaAdapter struct which satisfies the AdapterConverter interface (distinct from ModelConverter). The struct stores NumAttentionHeads which is populated from the base model config during KV() generation. LoRA adapters for non-attention layers (MLP projections) pass through without repacking.