Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion GlmOcr

From Leeroopedia
Knowledge Sources
Domains Model Conversion, OCR
Last Updated 2025-02-15 00:00 GMT

Overview

GLM OCR conversion handles the ChatGLM-based multimodal OCR model, transforming a vision-language architecture with M-RoPE (Multi-dimensional Rotary Position Embedding), fused gate-up projections, temporal patch embedding splitting, and Q/K weight permutation for NeoX-style rotation compatibility with the GGML runtime.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

Vision encoder:

  • model.visual.patch_embed.proj -> v.patch_embd
  • model.visual.patch_embed.proj_1 -> v.patch_embd_1
  • model.visual.blocks -> v.blk
  • model.visual.post_layernorm -> v.post_ln
  • model.visual.downsample -> mm.patch_merger
  • attn.qkv -> attn_qkv
  • attn.proj -> attn_out

Merger (multimodal projector):

  • model.visual.merger.proj -> mm.model.fc
  • model.visual.merger.post_projection_norm -> mm.post_norm
  • model.visual.merger.gate_proj -> mm.gate

Language model:

  • model.language_model.embed_tokens -> token_embd
  • model.language_model.layers -> blk
  • self_attn.o_proj -> attn_out
  • mlp.gate_up_proj -> ffn_gate_up (then split)

Architecture-Specific Hyperparameters

The GGUF metadata is written under the glmocr.* namespace:

Text:

  • glmocr.block_count, embedding_length, feed_forward_length, context_length
  • glmocr.attention.head_count, head_count_kv, key_length, value_length
  • glmocr.rope.freq_base, partial_rotary_factor, mrope_section

Vision:

  • glmocr.vision.block_count, embedding_length, out_hidden_size, intermediate_size
  • glmocr.vision.image_size, patch_size, spatial_merge_size, temporal_patch_size
  • glmocr.vision.min_pixels, max_pixels, image_mean, image_std

Special tokens:

  • glmocr.image_token_id, image_start_token_id, image_end_token_id
  • glmocr.video_token_id, video_start_token_id, video_end_token_id

Special Handling

Fused Gate-Up Splitting

The ffn_gate_up tensor is split along dimension 0 into separate ffn_gate and ffn_up tensors using the splitDim utility.

Temporal Patch Embedding Splitting

5D patch embedding weights with shape [out, in, 2, H, W] are split along the temporal dimension into two separate 4D tensors (patch_embd_0 and patch_embd_1). Pre-split variants with .0. and .1. suffixes are also handled.

Q/K Weight Permutation for M-RoPE

When M-RoPE sections are present, Q and K weight tensors are permuted from interleaved (LLaMA-style) to NeoX ordering using the normalToNeoXRepacker. This reorders rotary dimensions from [0,1,2,3,4,5...] to [0,2,4...,1,3,5...] so that GGML's NeoX-style M-RoPE kernel operates correctly.

Preprocessor Config

The converter reads preprocessor_config.json to extract image normalization parameters (mean, std) and size constraints (shortest_edge, longest_edge).

Multi-Token Prediction Layer Skipping

Layers beyond num_hidden_layers are skipped during conversion.

Implementation Notes

The conversion is implemented in convert/convert_glmocr.go via the glmOcrModel struct which satisfies both ModelConverter and moreParser interfaces. The normalToNeoXRepacker is a standalone function that performs per-head rotary dimension reordering, handling both weight (2D) and bias (1D) tensors with support for partial rotary factors.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment