Principle:Ollama Ollama GGUF Model Conversion Llama Adapter

Knowledge Sources	Ollama
Domains	Model Conversion, LoRA
Last Updated	2025-02-15 00:00 GMT

Overview

Llama LoRA adapter conversion transforms HuggingFace LoRA (Low-Rank Adaptation) adapter weights into GGUF format, handling the specific tensor naming conventions for LoRA A/B matrices, automatic shape transposition detection, and Q/K weight repacking to match the base model's interleaved head layout.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

base_model.model. -> (stripped)
model.layers -> blk
self_attn.q_proj -> attn_q
self_attn.k_proj -> attn_k
self_attn.v_proj -> attn_v
self_attn.o_proj -> attn_output
mlp.gate_proj -> ffn_gate
mlp.down_proj -> ffn_down
mlp.up_proj -> ffn_up
lora_A.weight / lora_a -> weight.lora_a
lora_B.weight / lora_b -> weight.lora_b

Architecture-Specific Hyperparameters

The adapter GGUF file uses the following metadata:

general.architecture -- set to llama (matching the base model)
llama.attention.head_count -- copied from the base model KV store
llama.attention.head_count_kv -- copied from the base model KV store

The adapter converter uses the AdapterParameters base (not ModelParameters) and reads head counts from the base model's GGUF config at conversion time.

Special Handling

Automatic Shape Detection and Transposition

The converter detects when LoRA A tensors have their dimensions swapped (shape[0] > shape[1] for lora_a, or shape[0] < shape[1] for lora_b) and automatically applies a transpose operation. Two repacker variants handle this:

repack -- for correctly-oriented tensors (applies only Q/K head permutation)
repackAndTranspose -- for transposed tensors (applies transpose first, then Q/K head permutation)

Q/K Head Permutation

LoRA A matrices for attn_q and attn_k tensors undergo the same interleaved-to-contiguous head permutation as full model weights. The tensor is reshaped to [heads, 2, head_dim/2, rank], transposed to [heads, head_dim/2, 2, rank], then flattened back. This ensures LoRA updates are applied in the correct head layout expected by GGML.

Base Model Dependency

Unlike full model converters, the adapter converter requires access to the base model's GGUF config (fs.Config) to read attention head counts. This is passed through the KV(baseKV) method signature which differs from the standard KV(tokenizer) interface.

Implementation Notes

The conversion is implemented in convert/convert_llama_adapter.go via the llamaAdapter struct which satisfies the AdapterConverter interface (distinct from ModelConverter). The struct stores NumAttentionHeads which is populated from the base model config during KV() generation. LoRA adapters for non-attention layers (MLP projections) pass through without repacking.

Related Pages

Implementation:Ollama_Ollama_Convert_Llama_Adapter

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment