Principle:Ollama Ollama GGUF Model Conversion Lfm2

Knowledge Sources	Ollama
Domains	Model Conversion, LFM
Last Updated	2025-02-15 00:00 GMT

Overview

LFM-2 (Liquid Foundation Model 2) conversion handles a novel hybrid architecture that alternates between short convolution layers and full attention layers, transforming the model from HuggingFace SafeTensors to GGUF format with per-layer KV head count arrays encoding the layer type information and convolution weight squeezing.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

model.embed_tokens -> token_embd
model.embedding_norm -> output_norm
model.layers -> blk
operator_norm -> attn_norm
self_attn.q_proj -> attn_q
self_attn.k_proj -> attn_k
self_attn.v_proj -> attn_v
self_attn.out_proj -> attn_output
self_attn.q_layernorm -> attn_q_norm
self_attn.k_layernorm -> attn_k_norm
conv.conv -> shortconv.conv
conv.in_proj -> shortconv.in_proj
conv.out_proj -> shortconv.out_proj
feed_forward.w1 -> ffn_gate
feed_forward.w2 -> ffn_down
feed_forward.w3 -> ffn_up

Architecture-Specific Hyperparameters

The GGUF metadata is written under the lfm2.* namespace:

lfm2.vocab_size -- vocabulary size
lfm2.block_count -- number of hidden layers
lfm2.embedding_length -- hidden size
lfm2.feed_forward_length -- intermediate size
lfm2.context_length -- maximum position embeddings
lfm2.attention.head_count -- number of attention heads
lfm2.attention.head_count_kv -- per-layer array (0 for conv layers, num_kv_heads for attention layers)
lfm2.attention.key_length / value_length -- derived from hidden_size / num_attention_heads
lfm2.attention.layer_norm_rms_epsilon -- normalization epsilon
lfm2.rope.freq_base -- RoPE theta
lfm2.shortconv.l_cache -- convolution cache length

Special Handling

Per-Layer KV Head Count Array

The layer_types string array from the config (containing "full_attention" or other types) is converted into a per-layer uint32 array for attention.head_count_kv. Attention layers get the actual num_key_value_heads value while short convolution layers get 0, allowing the runtime to dispatch the correct operator per layer.

Convolution Weight Squeezing

Short convolution weights with shape [D, 1, K] (3D with a singleton middle dimension) are squeezed to [D, K] (2D) for GGUF compatibility.

Unique Normalization Naming

LFM-2 uses embedding_norm for the output normalization (instead of the typical model.norm) and operator_norm for the pre-attention normalization.

Implementation Notes

The conversion is implemented in convert/convert_lfm2.go via the lfm2Model struct. The architecture uses SwiGLU-style feed-forward networks with w1/w2/w3 naming (gate/down/up). The hybrid layer design allows the model to use cheap convolution operations for most layers while reserving expensive attention for periodic global context aggregation.

Related Pages

Implementation:Ollama_Ollama_Convert_Lfm2

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment