Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion NomicBert

From Leeroopedia
Knowledge Sources
Domains Model Conversion, Embeddings
Last Updated 2025-02-15 00:00 GMT

Overview

Nomic BERT conversion handles the Nomic AI BERT variant with extended context length support via rotary position embeddings (RoPE), optional Mixture-of-Experts, and QKV fused attention projections, transforming the embedding model from HuggingFace SafeTensors to GGUF format with pooling configuration and phantom-space tokenization.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

  • encoder.layer / encoder.layers -> blk
  • embeddings.word_embeddings -> token_embd
  • embeddings.token_type_embeddings -> token_types
  • embeddings.LayerNorm -> token_embd_norm
  • attention.self.qkv -> attn_qkv (fused QKV)
  • attention.output.dense -> attn_output
  • attention.output.LayerNorm -> attn_output_norm
  • mlp.up -> ffn_up
  • mlp.down -> ffn_down
  • mlp.router -> ffn_gate_inp (MoE router)
  • mlp.experts.up -> ffn_up_exps (MoE expert up projections)
  • mlp.experts.down -> ffn_down_exps (MoE expert down projections)
  • intermediate.dense -> ffn_up (fallback)
  • output.dense -> ffn_down (fallback)
  • output.LayerNorm -> layer_output_norm

Architecture-Specific Hyperparameters

The GGUF metadata uses architecture-prefixed keys (either nomic-bert or nomic-bert-moe):

  • attention.causal -- set to false (bidirectional)
  • pooling_type -- 0 (none), 1 (mean), or 2 (CLS)
  • normalize_embeddings -- L2 normalization flag
  • block_count -- from n_layers or num_hidden_layers
  • context_length -- max position embeddings (extended via RoPE)
  • embedding_length, feed_forward_length
  • attention.head_count, head_count_kv (GQA support)
  • attention.layer_norm_epsilon -- LayerNorm epsilon
  • rope.freq_base -- RoPE theta

MoE parameters (when present):

  • expert_count -- number of local experts
  • expert_used_count -- experts per token
  • moe_every_n_layers -- MoE layer frequency

Special Handling

Dynamic Architecture Selection

The GGUF architecture identifier is dynamically set based on whether MoE parameters are present. If moe_every_n_layers > 0, the architecture is nomic-bert-moe; otherwise it is nomic-bert.

RoPE-Based Extended Context

Unlike standard BERT which uses absolute position embeddings (limited to 512 tokens), Nomic BERT uses rotary position embeddings enabling context lengths of 2048 or 8192 tokens. The rope_theta frequency base is stored in GGUF metadata.

Fused QKV Attention

Nomic BERT uses a fused attention.self.qkv projection instead of separate Q, K, V projections, mapping to the attn_qkv GGUF tensor name.

Pooling Configuration

Same as standard BERT: reads modules.json for Sentence Transformers pooling mode and normalization settings.

Phantom Space Tokenization

Same WordPiece-to-phantom-space conversion as standard BERT: special tokens kept as-is, ## prefix stripped, other tokens get U+2581 prefix.

Skipped Tensors

Same as BERT: embeddings.position_ids, pooler.dense.weight, and pooler.dense.bias are excluded.

Implementation Notes

The conversion is implemented in convert/convert_nomicbert.go via the nomicbertModel struct which satisfies both ModelConverter and moreParser interfaces. The struct supports both v1 (dense FFN) and v2 (MoE FFN) Nomic BERT variants through conditional parameter handling.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment