Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion DeepSeekOcr

From Leeroopedia
Knowledge Sources
Domains Model Conversion, OCR
Last Updated 2025-02-15 00:00 GMT

Overview

DeepSeek OCR variant conversion handles a multimodal architecture combining a DeepSeek-style MoE language model with dual vision encoders (CLIP and SAM) for optical character recognition tasks, transforming the complete pipeline from HuggingFace SafeTensors to GGUF format.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

Language model:

  • model.embed_tokens -> token_embd
  • model.layers -> blk
  • model.norm -> output_norm
  • lm_head -> output
  • self_attn.q_proj -> attn_q
  • self_attn.k_proj -> attn_k
  • self_attn.v_proj -> attn_v
  • self_attn.o_proj -> attn_output
  • mlp.shared_experts.{gate,up,down}_proj -> ffn_{gate,up,down}_shexp
  • mlp.gate -> ffn_gate_inp

Vision encoder:

  • model.vision_model -> v
  • embeddings.patch_embedding -> patch_embd
  • embeddings.class_embedding -> class_embd
  • embeddings.position_embedding -> position_embd
  • transformer.layers -> blk

SAM encoder:

  • model.sam_model.patch_embed.proj -> s.patch_embd
  • model.sam_model.pos_embed -> s.position_embd
  • model.sam_model.blocks -> s.blk
  • model.sam_model.neck -> s.neck

Projector:

  • model.projector -> mm
  • model.image_newline -> mm.image_newline
  • model.view_seperator -> mm.view_seperator (upstream misspelling preserved)

Architecture-Specific Hyperparameters

The GGUF metadata is written under the deepseekocr architecture with three sub-namespaces:

Language:

  • block_count, context_length, embedding_length, feed_forward_length
  • attention.head_count, attention.head_count_kv
  • expert_count, expert_used_count, leading_dense_block_count

CLIP vision:

  • vision.block_count, vision.embedding_length, vision.head_count
  • vision.image_size, vision.patch_size

SAM:

  • sam.block_count, sam.embedding_length, sam.head_count
  • sam.global_attention_indexes -- indices of layers using global (non-windowed) attention

Special Handling

Dual Vision Encoder

The model has two separate vision encoders: a CLIP-L-14-224 encoder for semantic features and a SAM ViT-B encoder for spatial/structural features. Both are mapped to distinct GGUF namespace prefixes (v. and s.).

Expert Tensor Merging

As with DeepSeek2, individual expert tensors are merged into stacked tensors per layer for gate, up, and down projections.

Nested Configuration

The HuggingFace config uses nested language_config and vision_config structures, with the vision config further nesting per-encoder parameters under named keys.

Implementation Notes

The conversion is implemented in convert/convert_deepseekocr.go via the deepseekocr struct. The struct uses nested Go structs matching the HuggingFace config hierarchy. The upstream misspelling of "view_seperator" is intentionally preserved to maintain model compatibility.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment