Implementation:Ollama Ollama Imagegen ZImage Transformer

Knowledge Sources	Ollama
Domains	Image Generation, Diffusion Models
Last Updated	2025-02-15 00:00 GMT

Overview

Implements the Z-Image diffusion transformer with timestep conditioning, caption embedding, QK normalization, and optional fused QKV projection.

Description

The transformer.go file defines the Z-Image transformer architecture with TimestepEmbedder (sinusoidal + MLP), XEmbedder for patch embedding, CapEmbedder for caption features with RMSNorm, and multi-head attention with QK normalization. The Attention struct supports optional fused QKV projection (FuseQKV) that concatenates Q/K/V weights into a single matmul for 5-10% speedup on non-quantized models. FeedForward uses SwiGLU (W1 gate, W3 up, W2 down). The transformer has configurable main layers and refiner layers, with AdaLN (Adaptive Layer Normalization) modulated by timestep embeddings. Each block produces shift/scale/gate parameters from the timestep embedding for both attention and MLP sub-layers.

Usage

Used as the core denoising network in the Z-Image pipeline, processing patchified latents with timestep and caption conditioning.

Code Reference

Source Location

Repository: Ollama
File: x/imagegen/models/zimage/transformer.go
Lines: 1-761

Signature

type TransformerConfig struct {
	Dim            int32   `json:"dim"`
	NHeads         int32   `json:"n_heads"`
	NKVHeads       int32   `json:"n_kv_heads"`
	NLayers        int32   `json:"n_layers"`
	NRefinerLayers int32   `json:"n_refiner_layers"`
	QKNorm         bool    `json:"qk_norm"`
	AxesDims       []int32 `json:"axes_dims"`
}

type Attention struct {
	ToQ   nn.LinearLayer `weight:"to_q"`
	ToK   nn.LinearLayer `weight:"to_k"`
	ToV   nn.LinearLayer `weight:"to_v"`
	ToOut nn.LinearLayer `weight:"to_out.0"`
	NormQ *mlx.Array     `weight:"norm_q.weight"`
	NormK *mlx.Array     `weight:"norm_k.weight"`
	ToQKV nn.LinearLayer `weight:"-"` // Fused (optional)
}

func (a *Attention) FuseQKV()
func (ff *FeedForward) Forward(x *mlx.Array) *mlx.Array

Import

import "github.com/ollama/ollama/x/imagegen/models/zimage"

I/O Contract

Inputs

Name	Type	Required	Description
x	*mlx.Array	Yes	Patchified latents [B, L, dim]
timestep	*mlx.Array	Yes	Timestep values [B] for conditioning
capFeats	*mlx.Array	Yes	Caption features from text encoder

Outputs

Name	Type	Description
*mlx.Array	*mlx.Array	Predicted velocity [B, L, dim]

Usage Examples

transformer := &zimage.Transformer{}
if err := transformer.Load(manifest); err != nil {
    return err
}

// Optional: fuse QKV for faster attention (FP models only)
transformer.FuseQKV()

velocity := transformer.Forward(latents, timestep, captionFeats)

Related Pages

Principle:Ollama_Ollama_ImageGeneration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment