Principle:Ollama Ollama GGUF Model Conversion DeepSeekOcr
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, OCR |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
DeepSeek OCR variant conversion handles a multimodal architecture combining a DeepSeek-style MoE language model with dual vision encoders (CLIP and SAM) for optical character recognition tasks, transforming the complete pipeline from HuggingFace SafeTensors to GGUF format.
Core Concepts
Tensor Name Mapping
The converter applies the following HuggingFace-to-GGUF tensor name replacements:
Language model:
model.embed_tokens->token_embdmodel.layers->blkmodel.norm->output_normlm_head->outputself_attn.q_proj->attn_qself_attn.k_proj->attn_kself_attn.v_proj->attn_vself_attn.o_proj->attn_outputmlp.shared_experts.{gate,up,down}_proj->ffn_{gate,up,down}_shexpmlp.gate->ffn_gate_inp
Vision encoder:
model.vision_model->vembeddings.patch_embedding->patch_embdembeddings.class_embedding->class_embdembeddings.position_embedding->position_embdtransformer.layers->blk
SAM encoder:
model.sam_model.patch_embed.proj->s.patch_embdmodel.sam_model.pos_embed->s.position_embdmodel.sam_model.blocks->s.blkmodel.sam_model.neck->s.neck
Projector:
model.projector->mmmodel.image_newline->mm.image_newlinemodel.view_seperator->mm.view_seperator(upstream misspelling preserved)
Architecture-Specific Hyperparameters
The GGUF metadata is written under the deepseekocr architecture with three sub-namespaces:
Language:
block_count,context_length,embedding_length,feed_forward_lengthattention.head_count,attention.head_count_kvexpert_count,expert_used_count,leading_dense_block_count
CLIP vision:
vision.block_count,vision.embedding_length,vision.head_countvision.image_size,vision.patch_size
SAM:
sam.block_count,sam.embedding_length,sam.head_countsam.global_attention_indexes-- indices of layers using global (non-windowed) attention
Special Handling
Dual Vision Encoder
The model has two separate vision encoders: a CLIP-L-14-224 encoder for semantic features and a SAM ViT-B encoder for spatial/structural features. Both are mapped to distinct GGUF namespace prefixes (v. and s.).
Expert Tensor Merging
As with DeepSeek2, individual expert tensors are merged into stacked tensors per layer for gate, up, and down projections.
Nested Configuration
The HuggingFace config uses nested language_config and vision_config structures, with the vision config further nesting per-encoder parameters under named keys.
Implementation Notes
The conversion is implemented in convert/convert_deepseekocr.go via the deepseekocr struct. The struct uses nested Go structs matching the HuggingFace config hierarchy. The upstream misspelling of "view_seperator" is intentionally preserved to maintain model compatibility.