Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba MNN Diffusion ONNX Export

From Leeroopedia


Field Value
principle_name Diffusion_ONNX_Export
schema_version 0.3.0
principle_type Workflow Step
domain Stable Diffusion Deployment
stage Model Export
scope Converting PyTorch Stable Diffusion pipeline components to ONNX intermediate representation
last_updated 2026-02-10 14:00 GMT

Overview

Diffusion ONNX Export is the second step in the Stable Diffusion deployment workflow with MNN. Before models can be converted to the MNN format, they must first be exported from their native PyTorch representation into ONNX (Open Neural Network Exchange), a vendor-neutral intermediate format. This step bridges the gap between the HuggingFace/diffusers ecosystem and the MNN inference engine.

Theory

ONNX serves as a standardized intermediate representation (IR) for neural networks. Exporting to ONNX involves:

  • Tracing the computation graph: PyTorch models are traced with representative dummy inputs using torch.onnx.export. This records all operations into a static graph that ONNX can represent.
  • Operator set (opset) selection: The ONNX opset version determines which operators are available. The default opset is 14, which provides sufficient coverage for all Stable Diffusion operations. Higher opsets (e.g., 18) may be specified for newer operator support.
  • Component-by-component export: Each pipeline component is exported as a separate ONNX model because they have different input/output signatures and are invoked at different stages of inference:
    • text_encoder -- Accepts input_ids (int32 token tensor), produces last_hidden_state and pooler_output.
    • unet -- Accepts sample (latent noise tensor), timestep (int32 diffusion step), and encoder_hidden_states (text embeddings); produces out_sample (denoised latent).
    • vae_encoder -- Accepts sample (pixel-space image), produces latent_sample.
    • vae_decoder -- Accepts latent_sample (latent tensor), produces sample (pixel-space image).
  • External data format for large models: The UNet exceeds 2 GB, which is the protobuf size limit for ONNX files. Therefore, UNet weights are stored in a separate weights.pb file using ONNX's external data format. All external tensor files are collated into a single weights.pb for cleaner organization.
  • Optional float16 export: When --fp16 is specified, models are loaded and traced in float16 precision, reducing file size and enabling faster inference on hardware with native FP16 support. This requires a CUDA-capable GPU.

Static Shape Export

The MNN export script uses static shapes (no dynamic axes) for all components. This means:

  • The text encoder always expects a fixed-length token sequence (typically 77 tokens, padded).
  • The UNet always expects a fixed spatial resolution (e.g., 64x64 latent for 512x512 output).
  • The VAE encoder/decoder expect the corresponding fixed spatial dimensions.

Static shapes enable more aggressive optimization during MNN conversion.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment