Principle:Alibaba MNN Diffusion ONNX Export
| Field | Value |
|---|---|
| principle_name | Diffusion_ONNX_Export |
| schema_version | 0.3.0 |
| principle_type | Workflow Step |
| domain | Stable Diffusion Deployment |
| stage | Model Export |
| scope | Converting PyTorch Stable Diffusion pipeline components to ONNX intermediate representation |
| last_updated | 2026-02-10 14:00 GMT |
Overview
Diffusion ONNX Export is the second step in the Stable Diffusion deployment workflow with MNN. Before models can be converted to the MNN format, they must first be exported from their native PyTorch representation into ONNX (Open Neural Network Exchange), a vendor-neutral intermediate format. This step bridges the gap between the HuggingFace/diffusers ecosystem and the MNN inference engine.
Theory
ONNX serves as a standardized intermediate representation (IR) for neural networks. Exporting to ONNX involves:
- Tracing the computation graph: PyTorch models are traced with representative dummy inputs using
torch.onnx.export. This records all operations into a static graph that ONNX can represent. - Operator set (opset) selection: The ONNX opset version determines which operators are available. The default opset is 14, which provides sufficient coverage for all Stable Diffusion operations. Higher opsets (e.g., 18) may be specified for newer operator support.
- Component-by-component export: Each pipeline component is exported as a separate ONNX model because they have different input/output signatures and are invoked at different stages of inference:
- text_encoder -- Accepts
input_ids(int32 token tensor), produceslast_hidden_stateandpooler_output. - unet -- Accepts
sample(latent noise tensor),timestep(int32 diffusion step), andencoder_hidden_states(text embeddings); producesout_sample(denoised latent). - vae_encoder -- Accepts
sample(pixel-space image), produceslatent_sample. - vae_decoder -- Accepts
latent_sample(latent tensor), producessample(pixel-space image).
- text_encoder -- Accepts
- External data format for large models: The UNet exceeds 2 GB, which is the protobuf size limit for ONNX files. Therefore, UNet weights are stored in a separate
weights.pbfile using ONNX's external data format. All external tensor files are collated into a singleweights.pbfor cleaner organization. - Optional float16 export: When
--fp16is specified, models are loaded and traced in float16 precision, reducing file size and enabling faster inference on hardware with native FP16 support. This requires a CUDA-capable GPU.
Static Shape Export
The MNN export script uses static shapes (no dynamic axes) for all components. This means:
- The text encoder always expects a fixed-length token sequence (typically 77 tokens, padded).
- The UNet always expects a fixed spatial resolution (e.g., 64x64 latent for 512x512 output).
- The VAE encoder/decoder expect the corresponding fixed spatial dimensions.
Static shapes enable more aggressive optimization during MNN conversion.