Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Imagegen ZImage Transformer

From Leeroopedia
Knowledge Sources
Domains Image Generation, Diffusion Models
Last Updated 2025-02-15 00:00 GMT

Overview

Implements the Z-Image diffusion transformer with timestep conditioning, caption embedding, QK normalization, and optional fused QKV projection.

Description

The transformer.go file defines the Z-Image transformer architecture with TimestepEmbedder (sinusoidal + MLP), XEmbedder for patch embedding, CapEmbedder for caption features with RMSNorm, and multi-head attention with QK normalization. The Attention struct supports optional fused QKV projection (FuseQKV) that concatenates Q/K/V weights into a single matmul for 5-10% speedup on non-quantized models. FeedForward uses SwiGLU (W1 gate, W3 up, W2 down). The transformer has configurable main layers and refiner layers, with AdaLN (Adaptive Layer Normalization) modulated by timestep embeddings. Each block produces shift/scale/gate parameters from the timestep embedding for both attention and MLP sub-layers.

Usage

Used as the core denoising network in the Z-Image pipeline, processing patchified latents with timestep and caption conditioning.

Code Reference

Source Location

  • Repository: Ollama
  • File: x/imagegen/models/zimage/transformer.go
  • Lines: 1-761

Signature

type TransformerConfig struct {
	Dim            int32   `json:"dim"`
	NHeads         int32   `json:"n_heads"`
	NKVHeads       int32   `json:"n_kv_heads"`
	NLayers        int32   `json:"n_layers"`
	NRefinerLayers int32   `json:"n_refiner_layers"`
	QKNorm         bool    `json:"qk_norm"`
	AxesDims       []int32 `json:"axes_dims"`
}

type Attention struct {
	ToQ   nn.LinearLayer `weight:"to_q"`
	ToK   nn.LinearLayer `weight:"to_k"`
	ToV   nn.LinearLayer `weight:"to_v"`
	ToOut nn.LinearLayer `weight:"to_out.0"`
	NormQ *mlx.Array     `weight:"norm_q.weight"`
	NormK *mlx.Array     `weight:"norm_k.weight"`
	ToQKV nn.LinearLayer `weight:"-"` // Fused (optional)
}

func (a *Attention) FuseQKV()
func (ff *FeedForward) Forward(x *mlx.Array) *mlx.Array

Import

import "github.com/ollama/ollama/x/imagegen/models/zimage"

I/O Contract

Inputs

Name Type Required Description
x *mlx.Array Yes Patchified latents [B, L, dim]
timestep *mlx.Array Yes Timestep values [B] for conditioning
capFeats *mlx.Array Yes Caption features from text encoder

Outputs

Name Type Description
*mlx.Array *mlx.Array Predicted velocity [B, L, dim]

Usage Examples

transformer := &zimage.Transformer{}
if err := transformer.Load(manifest); err != nil {
    return err
}

// Optional: fuse QKV for faster attention (FP models only)
transformer.FuseQKV()

velocity := transformer.Forward(latents, timestep, captionFeats)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment