Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion Glm4MoeLite

From Leeroopedia
Knowledge Sources
Domains Model Conversion, MoE
Last Updated 2025-02-15 00:00 GMT

Overview

GLM-4-MoE-Lite conversion handles the ChatGLM-4 Mixture-of-Experts architecture with Multi-head Latent Attention (MLA), transforming the model from HuggingFace SafeTensors to GGUF format while performing MLA absorption by splitting the combined KV_B tensor into separate K and V components with appropriate dimension transpositions.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

  • lm_head -> output
  • model.embed_tokens -> token_embd
  • model.norm -> output_norm
  • model.layers -> blk
  • self_attn.kv_a_proj_with_mqa -> attn_kv_a_mqa
  • self_attn.kv_a_layernorm -> attn_kv_a_norm
  • self_attn.kv_b_proj -> attn_kv_b
  • self_attn.q_a_proj -> attn_q_a
  • self_attn.q_a_layernorm -> attn_q_a_norm
  • self_attn.q_b_proj -> attn_q_b
  • self_attn.o_proj -> attn_output
  • mlp.shared_experts.{down,gate,up}_proj -> ffn_{down,gate,up}_shexp
  • mlp.gate.e_score_correction_bias -> exp_probs_b.bias
  • mlp.gate -> ffn_gate_inp

Architecture-Specific Hyperparameters

The GGUF metadata is written under the glm4moelite.* namespace:

  • glm4moelite.attention.key_length -- qk_nope_head_dim + qk_rope_head_dim
  • glm4moelite.attention.kv_lora_rank -- KV LoRA rank for MLA
  • glm4moelite.attention.q_lora_rank -- Q LoRA rank
  • glm4moelite.attention.value_length -- V head dimension
  • glm4moelite.attention.key_length_mla -- kv_lora_rank + qk_rope_head_dim (for MLA absorption)
  • glm4moelite.attention.value_length_mla -- equals kv_lora_rank
  • glm4moelite.expert_gating_func -- hardcoded to 2 (sigmoid)
  • glm4moelite.rope.dimension_count -- equals qk_rope_head_dim
  • glm4moelite.rope.freq_base -- defaults to 1000000.0

Special Handling

MLA KV_B Tensor Splitting

The combined attn_kv_b.weight tensor is split into separate attn_k_b.weight and attn_v_b.weight tensors for MLA absorption. The splitting logic:

  1. Detects the layout by checking which dimension matches kv_lora_rank
  2. Reshapes to [n_head, qk_nope + v_head, kv_lora_rank]
  3. Slices K portion: [n_head, :qk_nope, :] then transposes to [n_head, kv_lora_rank, qk_nope]
  4. Slices V portion: [n_head, qk_nope:, :] keeping layout as [n_head, v_head, kv_lora_rank]

Expert Tensor Merging

Individual expert tensors are merged into stacked tensors for gate, up, and down projections.

Multi-Token Prediction Layer Skipping

Layers beyond num_hidden_layers are filtered out during conversion.

Tokenizer

The tokenizer pre-processor is set to glm4.

Implementation Notes

The conversion is implemented in convert/convert_glm4moelite.go via the glm4MoeLiteModel struct. The repackKVB method creates repackers that handle both K and V extraction with automatic layout detection based on tensor shape. The converter handles both [kv_lora_rank, n_head*(qk_nope+v_head)] and transposed layouts.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment