Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion Qwen3Next

From Leeroopedia
Knowledge Sources
Domains Model Conversion, Qwen
Last Updated 2025-02-15 00:00 GMT

Overview

Qwen 3 Next generation conversion handles a hybrid architecture combining full attention layers with Gated Delta Net linear attention (recurrent) layers, Mixture-of-Experts with shared experts, partial rotary embeddings, and convolution operations, transforming the complete model from HuggingFace SafeTensors to GGUF format with extensive tensor splitting and transformation.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

Embeddings and output:

  • lm_head -> output
  • model.embed_tokens -> token_embd
  • model.norm -> output_norm
  • model.layers -> blk

Full attention:

  • self_attn.{q,k,v}_proj -> attn_{q,k,v}
  • self_attn.{q,k}_norm -> attn_{q,k}_norm
  • self_attn.o_proj -> attn_output

Linear attention (Gated Delta Net):

  • linear_attn.in_proj_qkvz -> ssm_in (then split)
  • linear_attn.in_proj_ba -> ssm_ba
  • linear_attn.conv1d -> ssm_conv1d
  • linear_attn.dt_bias / dt_proj -> ssm_dt
  • linear_attn.A_log -> ssm_a
  • linear_attn.norm -> ssm_norm
  • linear_attn.out_proj -> ssm_out

MoE:

  • mlp.gate.weight -> ffn_gate_inp.weight
  • mlp.shared_expert.{down,gate,up}_proj -> ffn_{down,gate,up}_shexp
  • mlp.shared_expert_gate -> ffn_gate_inp_shexp

Architecture-Specific Hyperparameters

The GGUF metadata is written under the qwen3next architecture:

Core:

  • block_count, context_length, embedding_length, feed_forward_length
  • attention.head_count, key_length, value_length
  • attention.layer_norm_rms_epsilon
  • attention.head_count_kv -- per-layer array (0 for recurrent, num_kv_heads for attention layers)
  • full_attention_interval -- how often full attention layers occur

RoPE:

  • rope.freq_base, rope.dimension_count (partial rotary)
  • rope.scaling.type, rope.scaling.factor

SSM/Linear attention:

  • ssm.inner_size -- value_head_dim * num_value_heads
  • ssm.state_size -- key head dimension
  • ssm.group_count -- number of key heads
  • ssm.time_step_rank -- number of value heads
  • ssm.conv_kernel -- convolution kernel dimension

MoE:

  • expert_count, expert_used_count, norm_top_k_prob
  • expert_feed_forward_length, expert_shared_feed_forward_length

Special Handling

QKVZ Tensor Splitting

The fused ssm_in tensor (containing Q, K, V, and Z gate projections) is split into two output tensors:

  • attn_qkv -- concatenation of Q, K, V projections
  • attn_gate -- Z gate projection

The split logic reshapes to [hidden, num_k_heads, qkvz_dim], slices each component (Q: head_k_dim, K: head_k_dim, V: v_per_head, Z: v_per_head), reshapes Q/K/V into contiguous blocks, and concatenates Q+K+V. The Z component becomes the gate tensor.

A_log to SSM_A Transformation

The linear_attn.A_log tensor is transformed by computing -exp(A_log) at conversion time, pre-computing the negated exponential for runtime efficiency.

Normalization Weight Offset

All *_norm.weight tensors (except ssm_norm) have 1.0 added to their values, following the Gemma convention where norm weights are zero-centered.

Conv1d Weight Squeezing

Convolution weights with 3D shapes containing a singleton dimension ([1, D, K] or [D, 1, K]) are squeezed to 2D [D, K].

Shared Expert Gate Squeezing

The ffn_gate_inp_shexp weight with shape [D, 1] or [1, D] is squeezed to 1D [D].

Expert Tensor Merging

Individual expert tensors are merged into stacked tensors for gate, up, and down projections per layer.

Partial Rotary Factor

Only a fraction of the head dimension uses RoPE, determined by partial_rotary_factor. The rope.dimension_count is computed as head_dim * partial_rotary_factor.

Strict Validation

The parseMore method performs extensive validation of config parameters, ensuring required fields are present and that the full_attention_interval produces at least one full attention layer.

Implementation Notes

The conversion is implemented in convert/convert_qwen3next.go via the qwen3NextModel struct. The qkvzSplitSpec struct encapsulates the dimension calculations for the QKVZ split. The tokenizer pre-processor is set to qwen2. This is one of the most complex converters due to the hybrid attention-recurrence architecture and extensive tensor transformations.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment