Principle:Ollama Ollama GGUF Model Conversion NomicBert

Knowledge Sources	Ollama
Domains	Model Conversion, Embeddings
Last Updated	2025-02-15 00:00 GMT

Overview

Nomic BERT conversion handles the Nomic AI BERT variant with extended context length support via rotary position embeddings (RoPE), optional Mixture-of-Experts, and QKV fused attention projections, transforming the embedding model from HuggingFace SafeTensors to GGUF format with pooling configuration and phantom-space tokenization.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

encoder.layer / encoder.layers -> blk
embeddings.word_embeddings -> token_embd
embeddings.token_type_embeddings -> token_types
embeddings.LayerNorm -> token_embd_norm
attention.self.qkv -> attn_qkv (fused QKV)
attention.output.dense -> attn_output
attention.output.LayerNorm -> attn_output_norm
mlp.up -> ffn_up
mlp.down -> ffn_down
mlp.router -> ffn_gate_inp (MoE router)
mlp.experts.up -> ffn_up_exps (MoE expert up projections)
mlp.experts.down -> ffn_down_exps (MoE expert down projections)
intermediate.dense -> ffn_up (fallback)
output.dense -> ffn_down (fallback)
output.LayerNorm -> layer_output_norm

Architecture-Specific Hyperparameters

The GGUF metadata uses architecture-prefixed keys (either nomic-bert or nomic-bert-moe):

attention.causal -- set to false (bidirectional)
pooling_type -- 0 (none), 1 (mean), or 2 (CLS)
normalize_embeddings -- L2 normalization flag
block_count -- from n_layers or num_hidden_layers
context_length -- max position embeddings (extended via RoPE)
embedding_length, feed_forward_length
attention.head_count, head_count_kv (GQA support)
attention.layer_norm_epsilon -- LayerNorm epsilon
rope.freq_base -- RoPE theta

MoE parameters (when present):

expert_count -- number of local experts
expert_used_count -- experts per token
moe_every_n_layers -- MoE layer frequency

Special Handling

Dynamic Architecture Selection

The GGUF architecture identifier is dynamically set based on whether MoE parameters are present. If moe_every_n_layers > 0, the architecture is nomic-bert-moe; otherwise it is nomic-bert.

RoPE-Based Extended Context

Unlike standard BERT which uses absolute position embeddings (limited to 512 tokens), Nomic BERT uses rotary position embeddings enabling context lengths of 2048 or 8192 tokens. The rope_theta frequency base is stored in GGUF metadata.

Fused QKV Attention

Nomic BERT uses a fused attention.self.qkv projection instead of separate Q, K, V projections, mapping to the attn_qkv GGUF tensor name.

Pooling Configuration

Same as standard BERT: reads modules.json for Sentence Transformers pooling mode and normalization settings.

Phantom Space Tokenization

Same WordPiece-to-phantom-space conversion as standard BERT: special tokens kept as-is, ## prefix stripped, other tokens get U+2581 prefix.

Skipped Tensors

Same as BERT: embeddings.position_ids, pooler.dense.weight, and pooler.dense.bias are excluded.

Implementation Notes

The conversion is implemented in convert/convert_nomicbert.go via the nomicbertModel struct which satisfies both ModelConverter and moreParser interfaces. The struct supports both v1 (dense FFN) and v2 (MoE FFN) Nomic BERT variants through conditional parameter handling.

Related Pages

Implementation:Ollama_Ollama_Convert_NomicBert

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment