Principle:Alibaba MNN Diffusion ONNX Export

Field	Value
principle_name	Diffusion_ONNX_Export
schema_version	0.3.0
principle_type	Workflow Step
domain	Stable Diffusion Deployment
stage	Model Export
scope	Converting PyTorch Stable Diffusion pipeline components to ONNX intermediate representation
last_updated	2026-02-10 14:00 GMT

Overview

Diffusion ONNX Export is the second step in the Stable Diffusion deployment workflow with MNN. Before models can be converted to the MNN format, they must first be exported from their native PyTorch representation into ONNX (Open Neural Network Exchange), a vendor-neutral intermediate format. This step bridges the gap between the HuggingFace/diffusers ecosystem and the MNN inference engine.

Theory

ONNX serves as a standardized intermediate representation (IR) for neural networks. Exporting to ONNX involves:

Tracing the computation graph: PyTorch models are traced with representative dummy inputs using torch.onnx.export. This records all operations into a static graph that ONNX can represent.
Operator set (opset) selection: The ONNX opset version determines which operators are available. The default opset is 14, which provides sufficient coverage for all Stable Diffusion operations. Higher opsets (e.g., 18) may be specified for newer operator support.
Component-by-component export: Each pipeline component is exported as a separate ONNX model because they have different input/output signatures and are invoked at different stages of inference:
- text_encoder -- Accepts input_ids (int32 token tensor), produces last_hidden_state and pooler_output.
- unet -- Accepts sample (latent noise tensor), timestep (int32 diffusion step), and encoder_hidden_states (text embeddings); produces out_sample (denoised latent).
- vae_encoder -- Accepts sample (pixel-space image), produces latent_sample.
- vae_decoder -- Accepts latent_sample (latent tensor), produces sample (pixel-space image).
External data format for large models: The UNet exceeds 2 GB, which is the protobuf size limit for ONNX files. Therefore, UNet weights are stored in a separate weights.pb file using ONNX's external data format. All external tensor files are collated into a single weights.pb for cleaner organization.
Optional float16 export: When --fp16 is specified, models are loaded and traced in float16 precision, reducing file size and enabling faster inference on hardware with native FP16 support. This requires a CUDA-capable GPU.

Static Shape Export

The MNN export script uses static shapes (no dynamic axes) for all components. This means:

The text encoder always expects a fixed-length token sequence (typically 77 tokens, padded).
The UNet always expects a fixed spatial resolution (e.g., 64x64 latent for 512x512 output).
The VAE encoder/decoder expect the corresponding fixed spatial dimensions.

Static shapes enable more aggressive optimization during MNN conversion.

Related Pages

Implementation:Alibaba_MNN_Onnx_Export_Script

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment