Implementation:Ollama Ollama Imagegen GLM4 MoE Lite
| Knowledge Sources | |
|---|---|
| Domains | Image Generation, LLM Inference |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the GLM4-MoE-Lite model with Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) for MLX inference.
Description
The glm4_moe_lite.go file implements the GLM4-MoE-Lite architecture featuring absorbed Multi-head Latent Attention (MLA) for reduced KV cache memory and Mixture of Experts with shared experts for efficient inference. The MLAAttention struct performs low-rank query/KV projections, latent space attention with absorbed embeddings (EmbedQ, UnembedOut derived from kv_b_proj), and RoPE on rope-specific dimensions. The MoE router selects top-k experts per token with group-based top-k selection and normalized probabilities. The Config supports quantization parameters (NVFP4, INT4, INT8) and RoPE scaling with mscale adjustment. First-k layers use dense MLP instead of MoE for stability.
Usage
Used for text generation with GLM4-MoE-Lite models in the MLX engine, supporting thinking mode, tool calling, and efficient MoE routing.
Code Reference
Source Location
- Repository: Ollama
- File: x/imagegen/models/glm4_moe_lite/glm4_moe_lite.go
- Lines: 1-840
Signature
type Config struct {
HiddenSize int32 `json:"hidden_size"`
NumHiddenLayers int32 `json:"num_hidden_layers"`
QLoraRank int32 `json:"q_lora_rank"`
KVLoraRank int32 `json:"kv_lora_rank"`
QKRopeHeadDim int32 `json:"qk_rope_head_dim"`
QKNopeHeadDim int32 `json:"qk_nope_head_dim"`
NRoutedExperts int32 `json:"n_routed_experts"`
NumExpertsPerTok int32 `json:"num_experts_per_tok"`
NGroup int32 `json:"n_group"`
}
type MLAAttention struct {
QAProj nn.LinearLayer `weight:"self_attn.q_a_proj"`
QBProj nn.LinearLayer `weight:"self_attn.q_b_proj"`
EmbedQ *nn.MultiLinear `weight:"-"`
UnembedOut *nn.MultiLinear `weight:"-"`
}
func (a *MLAAttention) Forward(x *mlx.Array, c cache.Cache, B, L int32, cfg *Config) *mlx.Array
Import
import "github.com/ollama/ollama/x/imagegen/models/glm4_moe_lite"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x | *mlx.Array | Yes | Hidden states [B, L, hidden_size] |
| c | cache.Cache | Yes | KV cache (stores latent representation) |
| cfg | *Config | Yes | Model configuration with MLA and MoE parameters |
Outputs
| Name | Type | Description |
|---|---|---|
| *mlx.Array | *mlx.Array | Attention output [B, L, hidden_size] |
Usage Examples
cfg := &glm4_moe_lite.Config{
QLoraRank: 1536,
KVLoraRank: 512,
QKRopeHeadDim: 64,
NRoutedExperts: 16,
NumExpertsPerTok: 4,
}
output := attention.Forward(hiddenStates, cache, batchSize, seqLen, cfg)