Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion Qwen3

From Leeroopedia
Knowledge Sources
Domains Model Conversion, Qwen
Last Updated 2025-02-15 00:00 GMT

Overview

Qwen 3 conversion handles the Alibaba Qwen 3 architecture in both standard dense and Mixture-of-Experts variants, transforming the model from HuggingFace SafeTensors to GGUF format with QK normalization, fused gate-up expert projection splitting, expert tensor transposition, and support for YaRN and M-RoPE scaling.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

  • lm_head -> output
  • model.embed_tokens -> token_embd
  • model.layers -> blk
  • model.norm -> output_norm
  • self_attn.k_proj -> attn_k
  • self_attn.k_norm -> attn_k_norm
  • self_attn.v_proj -> attn_v
  • self_attn.q_proj -> attn_q
  • self_attn.q_norm -> attn_q_norm
  • self_attn.o_proj -> attn_output
  • mlp.{down,gate,up}_proj -> ffn_{down,gate,up}
  • mlp.gate.weight -> ffn_gate_inp.weight (MoE router)
  • mlp.experts.down_proj -> ffn_down_exps.weight
  • mlp.experts.gate_up_proj -> ffn_gate_up_exps.weight
  • post_attention_layernorm -> ffn_norm

Architecture-Specific Hyperparameters

The GGUF metadata uses a dynamic architecture prefix (qwen3 for dense, qwen3moe for MoE):

  • block_count, context_length, embedding_length, feed_forward_length
  • attention.head_count, head_count_kv
  • attention.key_length / value_length -- explicit head dimension
  • attention.layer_norm_rms_epsilon -- RMSNorm epsilon
  • rope.freq_base -- RoPE theta

MoE parameters (when num_experts > 0):

  • expert_count, expert_used_count
  • norm_top_k_prob -- whether to normalize top-K probabilities

RoPE scaling:

  • rope.scaling.type -- "yarn" for YaRN
  • rope.scaling.factor -- scaling factor array
  • rope.mrope_section -- M-RoPE section sizes (for "mrope"/"default" types)

Special Handling

Dynamic Architecture Selection

The GGUF architecture identifier is dynamically set based on whether MoE parameters are present: qwen3 for dense models, qwen3moe for MoE variants.

Fused Gate-Up Expert Splitting and Transposition

MoE gate_up_exps tensors are split along dimension 2 into separate gate and up tensors. Each half is then transposed (dimensions 0, 2, 1 swapped) and the output shape is adjusted to reflect the transposition. This reorders from [experts, hidden, 2*intermediate] to [experts, intermediate, hidden] for each half.

Down Expert Transposition

MoE down_exps tensors are transposed from [experts, intermediate, hidden] to [experts, hidden, intermediate].

QK Normalization

Qwen 3 uses separate Q and K normalization layers, mapped to attn_q_norm and attn_k_norm in GGUF.

M-RoPE Support

When the RoPE scaling type is "mrope" or "default", the mrope_section array (specifying the dimension allocation for temporal, height, and width components) is stored in GGUF metadata.

Implementation Notes

The conversion is implemented in convert/convert_qwen3.go via the qwen3Model struct. The expert splitting uses the splitDim iterator with an afterFunc callback to apply the transposition. This struct also serves as the base type for the Qwen 3 VL multimodal variant. The ropeFactor type from Phi-3 is reused for YaRN scaling factors.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment