Principle:Ollama Ollama GGUF Model Conversion DeepSeek2
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, DeepSeek |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
DeepSeek2 model conversion handles the transformation of DeepSeek-V2/V3 architecture models from HuggingFace SafeTensors to GGUF format, with particular attention to Multi-head Latent Attention (MLA) projections, Mixture-of-Experts (MoE) routing with expert tensor merging, and YaRN RoPE scaling parameters.
Core Concepts
Tensor Name Mapping
The converter applies the following HuggingFace-to-GGUF tensor name replacements:
lm_head->outputmodel.embed_tokens->token_embdmodel.norm->output_normlanguage_model.-> (stripped)model.layers->blkinput_layernorm->attn_normself_attn.kv_a_proj_with_mqa->attn_kv_a_mqaself_attn.kv_a_layernorm->attn_kv_a_normself_attn.kv_b_proj->attn_kv_bself_attn.q_a_proj->attn_q_aself_attn.q_a_layernorm->attn_q_a_normself_attn.q_b_proj->attn_q_bself_attn.o_proj->attn_outputpost_attention_layernorm->ffn_normmlp.shared_experts.down_proj->ffn_down_shexpmlp.shared_experts.gate_proj->ffn_gate_shexpmlp.shared_experts.up_proj->ffn_up_shexpmlp.gate.e_score_correction_bias->exp_probs_b.biasmlp.gate->ffn_gate_inp
Architecture-Specific Hyperparameters
The GGUF metadata is written under the deepseek2.* namespace:
deepseek2.block_count-- number of hidden layersdeepseek2.attention.head_count/head_count_kv-- Q and KV head countsdeepseek2.attention.key_length--qk_nope_head_dim + qk_rope_head_dimdeepseek2.attention.kv_lora_rank-- KV LoRA rank for MLA compressiondeepseek2.attention.q_lora_rank-- Q LoRA rankdeepseek2.attention.value_length-- V head dimensiondeepseek2.expert_count/expert_used_count/expert_shared_countdeepseek2.expert_gating_func-- 1 for softmax, 2 for sigmoiddeepseek2.expert_weights_norm/expert_weights_scaledeepseek2.leading_dense_block_count-- number of initial dense (non-MoE) layersdeepseek2.rope.dimension_count-- equalsqk_rope_head_dimdeepseek2.rope.freq_base-- defaults to 10000.0deepseek2.rope.scaling.*-- YaRN scaling parameters includingyarn_log_multiplier
Special Handling
Expert Tensor Merging
Individual expert weight tensors (pattern: blk.N.mlp.experts.*.{gate,up,down}_proj.weight) are merged into stacked tensors (blk.N.ffn_{gate,up,down}_exps.weight). This creates three merged tensors per layer: gate, up, and down projections.
Multi-Token Prediction Layer Skipping
Layers with block indices >= num_hidden_layers are skipped during conversion, as they represent Multi-Token Prediction heads not needed for standard inference.
Tokenizer
The tokenizer pre-processor is set to deepseek-v3.
Implementation Notes
The conversion is implemented in convert/convert_deepseek2.go via the deepseek2Model struct. The expert merging logic uses the mergeTensors utility with glob-style patterns to match and stack individual expert tensors. A regex-based layer skipping function filters out extraneous prediction heads beyond the declared layer count.