Principle:Ollama Ollama GGUF Model Conversion Qwen3Next
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, Qwen |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Qwen 3 Next generation conversion handles a hybrid architecture combining full attention layers with Gated Delta Net linear attention (recurrent) layers, Mixture-of-Experts with shared experts, partial rotary embeddings, and convolution operations, transforming the complete model from HuggingFace SafeTensors to GGUF format with extensive tensor splitting and transformation.
Core Concepts
Tensor Name Mapping
The converter applies the following HuggingFace-to-GGUF tensor name replacements:
Embeddings and output:
lm_head->outputmodel.embed_tokens->token_embdmodel.norm->output_normmodel.layers->blk
Full attention:
self_attn.{q,k,v}_proj->attn_{q,k,v}self_attn.{q,k}_norm->attn_{q,k}_normself_attn.o_proj->attn_output
Linear attention (Gated Delta Net):
linear_attn.in_proj_qkvz->ssm_in(then split)linear_attn.in_proj_ba->ssm_balinear_attn.conv1d->ssm_conv1dlinear_attn.dt_bias/dt_proj->ssm_dtlinear_attn.A_log->ssm_alinear_attn.norm->ssm_normlinear_attn.out_proj->ssm_out
MoE:
mlp.gate.weight->ffn_gate_inp.weightmlp.shared_expert.{down,gate,up}_proj->ffn_{down,gate,up}_shexpmlp.shared_expert_gate->ffn_gate_inp_shexp
Architecture-Specific Hyperparameters
The GGUF metadata is written under the qwen3next architecture:
Core:
block_count,context_length,embedding_length,feed_forward_lengthattention.head_count,key_length,value_lengthattention.layer_norm_rms_epsilonattention.head_count_kv-- per-layer array (0 for recurrent, num_kv_heads for attention layers)full_attention_interval-- how often full attention layers occur
RoPE:
rope.freq_base,rope.dimension_count(partial rotary)rope.scaling.type,rope.scaling.factor
SSM/Linear attention:
ssm.inner_size--value_head_dim * num_value_headsssm.state_size-- key head dimensionssm.group_count-- number of key headsssm.time_step_rank-- number of value headsssm.conv_kernel-- convolution kernel dimension
MoE:
expert_count,expert_used_count,norm_top_k_probexpert_feed_forward_length,expert_shared_feed_forward_length
Special Handling
QKVZ Tensor Splitting
The fused ssm_in tensor (containing Q, K, V, and Z gate projections) is split into two output tensors:
attn_qkv-- concatenation of Q, K, V projectionsattn_gate-- Z gate projection
The split logic reshapes to [hidden, num_k_heads, qkvz_dim], slices each component (Q: head_k_dim, K: head_k_dim, V: v_per_head, Z: v_per_head), reshapes Q/K/V into contiguous blocks, and concatenates Q+K+V. The Z component becomes the gate tensor.
A_log to SSM_A Transformation
The linear_attn.A_log tensor is transformed by computing -exp(A_log) at conversion time, pre-computing the negated exponential for runtime efficiency.
Normalization Weight Offset
All *_norm.weight tensors (except ssm_norm) have 1.0 added to their values, following the Gemma convention where norm weights are zero-centered.
Conv1d Weight Squeezing
Convolution weights with 3D shapes containing a singleton dimension ([1, D, K] or [D, 1, K]) are squeezed to 2D [D, K].
The ffn_gate_inp_shexp weight with shape [D, 1] or [1, D] is squeezed to 1D [D].
Expert Tensor Merging
Individual expert tensors are merged into stacked tensors for gate, up, and down projections per layer.
Partial Rotary Factor
Only a fraction of the head dimension uses RoPE, determined by partial_rotary_factor. The rope.dimension_count is computed as head_dim * partial_rotary_factor.
Strict Validation
The parseMore method performs extensive validation of config parameters, ensuring required fields are present and that the full_attention_interval produces at least one full attention layer.
Implementation Notes
The conversion is implemented in convert/convert_qwen3next.go via the qwen3NextModel struct. The qkvzSplitSpec struct encapsulates the dimension calculations for the QKVZ split. The tokenizer pre-processor is set to qwen2. This is one of the most complex converters due to the hybrid attention-recurrence architecture and extensive tensor transformations.