Principle:Ollama Ollama GGUF Model Conversion Glm4MoeLite

Knowledge Sources	Ollama
Domains	Model Conversion, MoE
Last Updated	2025-02-15 00:00 GMT

Overview

GLM-4-MoE-Lite conversion handles the ChatGLM-4 Mixture-of-Experts architecture with Multi-head Latent Attention (MLA), transforming the model from HuggingFace SafeTensors to GGUF format while performing MLA absorption by splitting the combined KV_B tensor into separate K and V components with appropriate dimension transpositions.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

lm_head -> output
model.embed_tokens -> token_embd
model.norm -> output_norm
model.layers -> blk
self_attn.kv_a_proj_with_mqa -> attn_kv_a_mqa
self_attn.kv_a_layernorm -> attn_kv_a_norm
self_attn.kv_b_proj -> attn_kv_b
self_attn.q_a_proj -> attn_q_a
self_attn.q_a_layernorm -> attn_q_a_norm
self_attn.q_b_proj -> attn_q_b
self_attn.o_proj -> attn_output
mlp.shared_experts.{down,gate,up}_proj -> ffn_{down,gate,up}_shexp
mlp.gate.e_score_correction_bias -> exp_probs_b.bias
mlp.gate -> ffn_gate_inp

Architecture-Specific Hyperparameters

The GGUF metadata is written under the glm4moelite.* namespace:

glm4moelite.attention.key_length -- qk_nope_head_dim + qk_rope_head_dim
glm4moelite.attention.kv_lora_rank -- KV LoRA rank for MLA
glm4moelite.attention.q_lora_rank -- Q LoRA rank
glm4moelite.attention.value_length -- V head dimension
glm4moelite.attention.key_length_mla -- kv_lora_rank + qk_rope_head_dim (for MLA absorption)
glm4moelite.attention.value_length_mla -- equals kv_lora_rank
glm4moelite.expert_gating_func -- hardcoded to 2 (sigmoid)
glm4moelite.rope.dimension_count -- equals qk_rope_head_dim
glm4moelite.rope.freq_base -- defaults to 1000000.0

Special Handling

MLA KV_B Tensor Splitting

The combined attn_kv_b.weight tensor is split into separate attn_k_b.weight and attn_v_b.weight tensors for MLA absorption. The splitting logic:

Detects the layout by checking which dimension matches kv_lora_rank
Reshapes to [n_head, qk_nope + v_head, kv_lora_rank]
Slices K portion: [n_head, :qk_nope, :] then transposes to [n_head, kv_lora_rank, qk_nope]
Slices V portion: [n_head, qk_nope:, :] keeping layout as [n_head, v_head, kv_lora_rank]

Expert Tensor Merging

Individual expert tensors are merged into stacked tensors for gate, up, and down projections.

Multi-Token Prediction Layer Skipping

Layers beyond num_hidden_layers are filtered out during conversion.

Tokenizer

The tokenizer pre-processor is set to glm4.

Implementation Notes

The conversion is implemented in convert/convert_glm4moelite.go via the glm4MoeLiteModel struct. The repackKVB method creates repackers that handle both K and V extraction with automatic layout detection based on tensor shape. The converter handles both [kv_lora_rank, n_head*(qk_nope+v_head)] and transposed layouts.

Related Pages

Implementation:Ollama_Ollama_Convert_Glm4MoeLite

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment