Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion Lfm2

From Leeroopedia
Knowledge Sources
Domains Model Conversion, LFM
Last Updated 2025-02-15 00:00 GMT

Overview

LFM-2 (Liquid Foundation Model 2) conversion handles a novel hybrid architecture that alternates between short convolution layers and full attention layers, transforming the model from HuggingFace SafeTensors to GGUF format with per-layer KV head count arrays encoding the layer type information and convolution weight squeezing.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

  • model.embed_tokens -> token_embd
  • model.embedding_norm -> output_norm
  • model.layers -> blk
  • operator_norm -> attn_norm
  • self_attn.q_proj -> attn_q
  • self_attn.k_proj -> attn_k
  • self_attn.v_proj -> attn_v
  • self_attn.out_proj -> attn_output
  • self_attn.q_layernorm -> attn_q_norm
  • self_attn.k_layernorm -> attn_k_norm
  • conv.conv -> shortconv.conv
  • conv.in_proj -> shortconv.in_proj
  • conv.out_proj -> shortconv.out_proj
  • feed_forward.w1 -> ffn_gate
  • feed_forward.w2 -> ffn_down
  • feed_forward.w3 -> ffn_up

Architecture-Specific Hyperparameters

The GGUF metadata is written under the lfm2.* namespace:

  • lfm2.vocab_size -- vocabulary size
  • lfm2.block_count -- number of hidden layers
  • lfm2.embedding_length -- hidden size
  • lfm2.feed_forward_length -- intermediate size
  • lfm2.context_length -- maximum position embeddings
  • lfm2.attention.head_count -- number of attention heads
  • lfm2.attention.head_count_kv -- per-layer array (0 for conv layers, num_kv_heads for attention layers)
  • lfm2.attention.key_length / value_length -- derived from hidden_size / num_attention_heads
  • lfm2.attention.layer_norm_rms_epsilon -- normalization epsilon
  • lfm2.rope.freq_base -- RoPE theta
  • lfm2.shortconv.l_cache -- convolution cache length

Special Handling

Per-Layer KV Head Count Array

The layer_types string array from the config (containing "full_attention" or other types) is converted into a per-layer uint32 array for attention.head_count_kv. Attention layers get the actual num_key_value_heads value while short convolution layers get 0, allowing the runtime to dispatch the correct operator per layer.

Convolution Weight Squeezing

Short convolution weights with shape [D, 1, K] (3D with a singleton middle dimension) are squeezed to [D, K] (2D) for GGUF compatibility.

Unique Normalization Naming

LFM-2 uses embedding_norm for the output normalization (instead of the typical model.norm) and operator_norm for the pre-attention normalization.

Implementation Notes

The conversion is implemented in convert/convert_lfm2.go via the lfm2Model struct. The architecture uses SwiGLU-style feed-forward networks with w1/w2/w3 naming (gate/down/up). The hybrid layer design allows the model to use cheap convolution operations for most layers while reserving expensive attention for periodic global context aggregation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment