Implementation:Ollama Ollama Imagegen ZImage Transformer
| Knowledge Sources | |
|---|---|
| Domains | Image Generation, Diffusion Models |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the Z-Image diffusion transformer with timestep conditioning, caption embedding, QK normalization, and optional fused QKV projection.
Description
The transformer.go file defines the Z-Image transformer architecture with TimestepEmbedder (sinusoidal + MLP), XEmbedder for patch embedding, CapEmbedder for caption features with RMSNorm, and multi-head attention with QK normalization. The Attention struct supports optional fused QKV projection (FuseQKV) that concatenates Q/K/V weights into a single matmul for 5-10% speedup on non-quantized models. FeedForward uses SwiGLU (W1 gate, W3 up, W2 down). The transformer has configurable main layers and refiner layers, with AdaLN (Adaptive Layer Normalization) modulated by timestep embeddings. Each block produces shift/scale/gate parameters from the timestep embedding for both attention and MLP sub-layers.
Usage
Used as the core denoising network in the Z-Image pipeline, processing patchified latents with timestep and caption conditioning.
Code Reference
Source Location
- Repository: Ollama
- File: x/imagegen/models/zimage/transformer.go
- Lines: 1-761
Signature
type TransformerConfig struct {
Dim int32 `json:"dim"`
NHeads int32 `json:"n_heads"`
NKVHeads int32 `json:"n_kv_heads"`
NLayers int32 `json:"n_layers"`
NRefinerLayers int32 `json:"n_refiner_layers"`
QKNorm bool `json:"qk_norm"`
AxesDims []int32 `json:"axes_dims"`
}
type Attention struct {
ToQ nn.LinearLayer `weight:"to_q"`
ToK nn.LinearLayer `weight:"to_k"`
ToV nn.LinearLayer `weight:"to_v"`
ToOut nn.LinearLayer `weight:"to_out.0"`
NormQ *mlx.Array `weight:"norm_q.weight"`
NormK *mlx.Array `weight:"norm_k.weight"`
ToQKV nn.LinearLayer `weight:"-"` // Fused (optional)
}
func (a *Attention) FuseQKV()
func (ff *FeedForward) Forward(x *mlx.Array) *mlx.Array
Import
import "github.com/ollama/ollama/x/imagegen/models/zimage"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x | *mlx.Array | Yes | Patchified latents [B, L, dim] |
| timestep | *mlx.Array | Yes | Timestep values [B] for conditioning |
| capFeats | *mlx.Array | Yes | Caption features from text encoder |
Outputs
| Name | Type | Description |
|---|---|---|
| *mlx.Array | *mlx.Array | Predicted velocity [B, L, dim] |
Usage Examples
transformer := &zimage.Transformer{}
if err := transformer.Load(manifest); err != nil {
return err
}
// Optional: fuse QKV for faster attention (FP models only)
transformer.FuseQKV()
velocity := transformer.Forward(latents, timestep, captionFeats)