Principle:Ollama Ollama GGUF Model Conversion Glm4MoeLite
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, MoE |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
GLM-4-MoE-Lite conversion handles the ChatGLM-4 Mixture-of-Experts architecture with Multi-head Latent Attention (MLA), transforming the model from HuggingFace SafeTensors to GGUF format while performing MLA absorption by splitting the combined KV_B tensor into separate K and V components with appropriate dimension transpositions.
Core Concepts
Tensor Name Mapping
The converter applies the following HuggingFace-to-GGUF tensor name replacements:
lm_head->outputmodel.embed_tokens->token_embdmodel.norm->output_normmodel.layers->blkself_attn.kv_a_proj_with_mqa->attn_kv_a_mqaself_attn.kv_a_layernorm->attn_kv_a_normself_attn.kv_b_proj->attn_kv_bself_attn.q_a_proj->attn_q_aself_attn.q_a_layernorm->attn_q_a_normself_attn.q_b_proj->attn_q_bself_attn.o_proj->attn_outputmlp.shared_experts.{down,gate,up}_proj->ffn_{down,gate,up}_shexpmlp.gate.e_score_correction_bias->exp_probs_b.biasmlp.gate->ffn_gate_inp
Architecture-Specific Hyperparameters
The GGUF metadata is written under the glm4moelite.* namespace:
glm4moelite.attention.key_length--qk_nope_head_dim + qk_rope_head_dimglm4moelite.attention.kv_lora_rank-- KV LoRA rank for MLAglm4moelite.attention.q_lora_rank-- Q LoRA rankglm4moelite.attention.value_length-- V head dimensionglm4moelite.attention.key_length_mla--kv_lora_rank + qk_rope_head_dim(for MLA absorption)glm4moelite.attention.value_length_mla-- equalskv_lora_rankglm4moelite.expert_gating_func-- hardcoded to 2 (sigmoid)glm4moelite.rope.dimension_count-- equalsqk_rope_head_dimglm4moelite.rope.freq_base-- defaults to 1000000.0
Special Handling
MLA KV_B Tensor Splitting
The combined attn_kv_b.weight tensor is split into separate attn_k_b.weight and attn_v_b.weight tensors for MLA absorption. The splitting logic:
- Detects the layout by checking which dimension matches
kv_lora_rank - Reshapes to
[n_head, qk_nope + v_head, kv_lora_rank] - Slices K portion:
[n_head, :qk_nope, :]then transposes to[n_head, kv_lora_rank, qk_nope] - Slices V portion:
[n_head, qk_nope:, :]keeping layout as[n_head, v_head, kv_lora_rank]
Expert Tensor Merging
Individual expert tensors are merged into stacked tensors for gate, up, and down projections.
Multi-Token Prediction Layer Skipping
Layers beyond num_hidden_layers are filtered out during conversion.
Tokenizer
The tokenizer pre-processor is set to glm4.
Implementation Notes
The conversion is implemented in convert/convert_glm4moelite.go via the glm4MoeLiteModel struct. The repackKVB method creates repackers that handle both K and V extraction with automatic layout detection based on tensor shape. The converter handles both [kv_lora_rank, n_head*(qk_nope+v_head)] and transposed layouts.