Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Imagegen GLM4 MoE Lite

From Leeroopedia
Knowledge Sources
Domains Image Generation, LLM Inference
Last Updated 2025-02-15 00:00 GMT

Overview

Implements the GLM4-MoE-Lite model with Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) for MLX inference.

Description

The glm4_moe_lite.go file implements the GLM4-MoE-Lite architecture featuring absorbed Multi-head Latent Attention (MLA) for reduced KV cache memory and Mixture of Experts with shared experts for efficient inference. The MLAAttention struct performs low-rank query/KV projections, latent space attention with absorbed embeddings (EmbedQ, UnembedOut derived from kv_b_proj), and RoPE on rope-specific dimensions. The MoE router selects top-k experts per token with group-based top-k selection and normalized probabilities. The Config supports quantization parameters (NVFP4, INT4, INT8) and RoPE scaling with mscale adjustment. First-k layers use dense MLP instead of MoE for stability.

Usage

Used for text generation with GLM4-MoE-Lite models in the MLX engine, supporting thinking mode, tool calling, and efficient MoE routing.

Code Reference

Source Location

  • Repository: Ollama
  • File: x/imagegen/models/glm4_moe_lite/glm4_moe_lite.go
  • Lines: 1-840

Signature

type Config struct {
	HiddenSize        int32   `json:"hidden_size"`
	NumHiddenLayers   int32   `json:"num_hidden_layers"`
	QLoraRank         int32   `json:"q_lora_rank"`
	KVLoraRank        int32   `json:"kv_lora_rank"`
	QKRopeHeadDim     int32   `json:"qk_rope_head_dim"`
	QKNopeHeadDim     int32   `json:"qk_nope_head_dim"`
	NRoutedExperts    int32   `json:"n_routed_experts"`
	NumExpertsPerTok  int32   `json:"num_experts_per_tok"`
	NGroup            int32   `json:"n_group"`
}

type MLAAttention struct {
	QAProj      nn.LinearLayer `weight:"self_attn.q_a_proj"`
	QBProj      nn.LinearLayer `weight:"self_attn.q_b_proj"`
	EmbedQ      *nn.MultiLinear `weight:"-"`
	UnembedOut  *nn.MultiLinear `weight:"-"`
}

func (a *MLAAttention) Forward(x *mlx.Array, c cache.Cache, B, L int32, cfg *Config) *mlx.Array

Import

import "github.com/ollama/ollama/x/imagegen/models/glm4_moe_lite"

I/O Contract

Inputs

Name Type Required Description
x *mlx.Array Yes Hidden states [B, L, hidden_size]
c cache.Cache Yes KV cache (stores latent representation)
cfg *Config Yes Model configuration with MLA and MoE parameters

Outputs

Name Type Description
*mlx.Array *mlx.Array Attention output [B, L, hidden_size]

Usage Examples

cfg := &glm4_moe_lite.Config{
    QLoraRank:      1536,
    KVLoraRank:     512,
    QKRopeHeadDim:  64,
    NRoutedExperts:  16,
    NumExpertsPerTok: 4,
}

output := attention.Forward(hiddenStates, cache, batchSize, seqLen, cfg)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment