Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion DeepSeek2

From Leeroopedia
Knowledge Sources
Domains Model Conversion, DeepSeek
Last Updated 2025-02-15 00:00 GMT

Overview

DeepSeek2 model conversion handles the transformation of DeepSeek-V2/V3 architecture models from HuggingFace SafeTensors to GGUF format, with particular attention to Multi-head Latent Attention (MLA) projections, Mixture-of-Experts (MoE) routing with expert tensor merging, and YaRN RoPE scaling parameters.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

  • lm_head -> output
  • model.embed_tokens -> token_embd
  • model.norm -> output_norm
  • language_model. -> (stripped)
  • model.layers -> blk
  • input_layernorm -> attn_norm
  • self_attn.kv_a_proj_with_mqa -> attn_kv_a_mqa
  • self_attn.kv_a_layernorm -> attn_kv_a_norm
  • self_attn.kv_b_proj -> attn_kv_b
  • self_attn.q_a_proj -> attn_q_a
  • self_attn.q_a_layernorm -> attn_q_a_norm
  • self_attn.q_b_proj -> attn_q_b
  • self_attn.o_proj -> attn_output
  • post_attention_layernorm -> ffn_norm
  • mlp.shared_experts.down_proj -> ffn_down_shexp
  • mlp.shared_experts.gate_proj -> ffn_gate_shexp
  • mlp.shared_experts.up_proj -> ffn_up_shexp
  • mlp.gate.e_score_correction_bias -> exp_probs_b.bias
  • mlp.gate -> ffn_gate_inp

Architecture-Specific Hyperparameters

The GGUF metadata is written under the deepseek2.* namespace:

  • deepseek2.block_count -- number of hidden layers
  • deepseek2.attention.head_count / head_count_kv -- Q and KV head counts
  • deepseek2.attention.key_length -- qk_nope_head_dim + qk_rope_head_dim
  • deepseek2.attention.kv_lora_rank -- KV LoRA rank for MLA compression
  • deepseek2.attention.q_lora_rank -- Q LoRA rank
  • deepseek2.attention.value_length -- V head dimension
  • deepseek2.expert_count / expert_used_count / expert_shared_count
  • deepseek2.expert_gating_func -- 1 for softmax, 2 for sigmoid
  • deepseek2.expert_weights_norm / expert_weights_scale
  • deepseek2.leading_dense_block_count -- number of initial dense (non-MoE) layers
  • deepseek2.rope.dimension_count -- equals qk_rope_head_dim
  • deepseek2.rope.freq_base -- defaults to 10000.0
  • deepseek2.rope.scaling.* -- YaRN scaling parameters including yarn_log_multiplier

Special Handling

Expert Tensor Merging

Individual expert weight tensors (pattern: blk.N.mlp.experts.*.{gate,up,down}_proj.weight) are merged into stacked tensors (blk.N.ffn_{gate,up,down}_exps.weight). This creates three merged tensors per layer: gate, up, and down projections.

Multi-Token Prediction Layer Skipping

Layers with block indices >= num_hidden_layers are skipped during conversion, as they represent Multi-Token Prediction heads not needed for standard inference.

Tokenizer

The tokenizer pre-processor is set to deepseek-v3.

Implementation Notes

The conversion is implemented in convert/convert_deepseek2.go via the deepseek2Model struct. The expert merging logic uses the mergeTensors utility with glob-style patterns to match and stack individual expert tensors. A regex-based layer skipping function filters out extraneous prediction heads beyond the declared layer count.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment