Implementation:Alibaba MNN Diffusion Demo CLI
| Field | Value |
|---|---|
| implementation_name | Diffusion_Demo_CLI |
| schema_version | 0.3.0 |
| impl_type | API Doc |
| domain | Stable Diffusion Deployment |
| stage | Inference Execution |
| source_file | transformers/diffusion/engine/diffusion_demo.cpp (L8-74) |
| external_deps | libMNN (compiled with diffusion support) |
| last_updated | 2026-02-10 14:00 GMT |
Summary
This implementation documents the diffusion_demo CLI binary, which executes the full Stable Diffusion text-to-image pipeline using converted MNN models. The binary accepts 8 positional arguments specifying the model path, model type, memory mode, backend type, iteration count, random seed, output image path, and prompt text.
API
./diffusion_demo <resource_path> <model_type> <memory_mode> <backend_type> <iteration_num> <random_seed> <output_image> <prompt_text>
Key Parameters
| Parameter | Position | Type | Description | Valid Values |
|---|---|---|---|---|
| resource_path | argv[1] | string | Path to directory containing MNN model files and tokenizer | Directory path |
| model_type | argv[2] | int | Diffusion model variant (cast to DiffusionModelType) |
0 = STABLE_DIFFUSION_1_5, 1 = STABLE_DIFFUSION_TAIYI_CHINESE, 2 = SANA_DIFFUSION |
| memory_mode | argv[3] | int | Memory management strategy | 0 = memory saving (2GB+), 1 = memory enough (fast), 2 = balance |
| backend_type | argv[4] | int | Hardware backend (cast to MNNForwardType) |
0 = CPU, 3 = OpenCL, 4 = Metal |
| iteration_num | argv[5] | int | Number of denoising iterations | Typically 10-20 |
| random_seed | argv[6] | int | Random seed for noise initialization | -1 for random, or any positive integer for reproducibility |
| output_image | argv[7] | string | Output image file path | e.g., output.jpg, result.png
|
| prompt_text | argv[8+] | string(s) | Text prompt for image generation (multiple words are joined with spaces) | Any text string |
Inputs
An MNN model directory (the resource_path) containing:
text_encoder.mnn-- Converted CLIP text encoderunet.mnn-- Converted UNet denoising networkvae_encoder.mnn-- Converted VAE encoder (for img2img)vae_decoder.mnn-- Converted VAE decoder- Tokenizer vocabulary files (e.g.,
vocab.json,merges.txt)
Outputs
- A generated image file (JPEG or PNG) at the specified
output_imagepath - Progress percentage printed to stdout via the progress callback
Core Code Flow
int main(int argc, const char* argv[]) {
if (argc < 9) {
MNN_PRINT("Usage: ./diffusion_demo <resource_path> <model_type> <memory_mode> "
"<backend_type> <iteration_num> <random_seed> <output_image_name> <prompt_text>\n");
return 0;
}
auto resource_path = argv[1];
auto model_type = (DiffusionModelType)atoi(argv[2]);
auto memory_mode = atoi(argv[3]);
auto backend_type = (MNNForwardType)atoi(argv[4]);
auto iteration_num = atoi(argv[5]);
auto random_seed = atoi(argv[6]);
auto img_name = argv[7];
// Join remaining arguments as prompt text
std::string input_text;
for (int i = 8; i < argc; ++i) {
input_text += argv[i];
if (i < argc - 1) input_text += " ";
}
// Create diffusion pipeline via factory method
std::unique_ptr<Diffusion> diffusion(
Diffusion::createDiffusion(resource_path, model_type, backend_type, memory_mode));
// Load model components
diffusion->load();
// Run inference with progress callback
auto progressDisplay = [](int progress) {
std::cout << "Progress: " << progress << "%" << std::endl;
};
diffusion->run(input_text, img_name, iteration_num, random_seed, progressDisplay);
return 0;
}
Factory Method and Class Hierarchy
The Diffusion::createDiffusion static factory method selects the appropriate implementation based on model_type:
// From diffusion.hpp
static Diffusion* createDiffusion(std::string modelPath, DiffusionModelType modelType,
MNNForwardType backendType, int memoryMode);
// From diffusion.cpp -- returns StableDiffusion or SanaDiffusion
Diffusion* Diffusion::createDiffusion(std::string modelPath, DiffusionModelType modelType,
MNNForwardType backendType, int memoryMode) {
if (modelType == SANA_DIFFUSION) {
return new SanaDiffusion(modelPath, modelType, backendType, memoryMode);
} else {
return new StableDiffusion(modelPath, modelType, backendType, memoryMode);
}
}
The DiffusionModelType enum is defined as:
typedef enum {
STABLE_DIFFUSION_1_5 = 0,
STABLE_DIFFUSION_TAIYI_CHINESE = 1,
SANA_DIFFUSION = 2,
DIFFUSION_MODEL_USER
} DiffusionModelType;
Usage Examples
Generate an image with SD v1.5 on CPU:
./diffusion_demo ./mnn_sd15 0 0 0 20 42 output.jpg "a beautiful sunset over the ocean"
Generate with Taiyi Chinese model on OpenCL GPU:
./diffusion_demo ./mnn_taiyi 1 1 3 15 -1 result.png "一只可爱的猫咪在花园里玩耍"
Generate with SD v1.5 on Metal GPU (macOS) in balanced memory mode:
./diffusion_demo ./mnn_sd15 0 2 4 20 123 photo.jpg "a photorealistic portrait of a woman"
Multi-Generation Pattern
The source code includes a commented-out loop demonstrating how to generate multiple images in sequence:
// For multiple generations:
// - Memory saving mode (0): call diffusion->load() before each run
// - Memory enough mode (1): only load once, then call run() repeatedly
while(0) {
if(memory_mode != 1) {
diffusion->load();
}
diffusion->run("a big horse", "demo_2.jpg", 20, 42, progressDisplay);
}
Notes
- The prompt text is constructed by joining all arguments from position 8 onward with spaces, so the prompt does not need to be quoted on the command line (though quoting is recommended for special characters).
- The
progressDisplaylambda callback prints percentage progress to stdout during the denoising loop. - The binary requires at least 9 arguments (argc < 9 check); if fewer are provided, it prints a usage message and exits.
- Memory mode 0 is recommended for devices with limited RAM (2 GB+); mode 1 is recommended for desktop/server environments with ample memory.