Workflow:Alibaba MNN LLM Deployment Pipeline

Knowledge Sources	Alibaba MNN MNN LLM Guide MNN Docs
Domains	LLMs, Model_Deployment, On_Device_AI
Last Updated	2026-02-10 08:00 GMT

Overview

End-to-end process for deploying large language models (LLMs) to on-device inference using MNN-LLM, covering model export from PyTorch/Hugging Face, engine compilation, runtime configuration, and interactive text generation.

Description

This workflow covers the complete pipeline for taking a Hugging Face or ModelScope LLM (such as Qwen, Llama, Baichuan, DeepSeek, or similar transformer models) and deploying it for efficient on-device inference using MNN's LLM runtime. The process involves exporting the model to MNN format using the llmexport tool (which handles tokenizer extraction, embedding export, and weight quantization), compiling the MNN engine with LLM support macros, configuring runtime parameters (hardware backend, sampling strategy, KV cache, memory management), and running inference through the llm_demo CLI or programmatic API. The pipeline supports text-only, visual (VL), and audio (Omni) multimodal models.

Key outputs:

MNN-format LLM model files (llm.mnn, llm.mnn.weight, tokenizer.txt, embeddings, config files)
Compiled MNN engine with LLM inference support (libMNN, libllm)
Interactive chat or batch text generation capability on CPU, GPU (OpenCL/Metal/CUDA), or NPU

Usage

Execute this workflow when you need to deploy an open-source LLM (Qwen, Llama, Baichuan, DeepSeek, etc.) for on-device inference on mobile phones, PCs, embedded systems, or web browsers. This is the primary workflow for building local, privacy-preserving LLM applications without cloud API dependencies.

Execution Steps

Step 1: Prepare the source LLM

Clone the target LLM from Hugging Face or ModelScope to your local filesystem. Ensure git-lfs is installed and functional so that the full model weights are downloaded, not just pointer files. Verify the download by checking file sizes of the model weight files.

Key considerations:

Install git-lfs before cloning: git lfs install
Supported model families include Qwen, Llama, Baichuan, DeepSeek, ChatGLM, and others
Verify model weight file sizes after cloning to confirm complete download

Step 2: Export model to MNN format

Run the llmexport.py script from the transformers/llm/export directory. This tool exports the PyTorch model to ONNX, converts it to MNN format, and extracts the tokenizer, embeddings, and configuration files. Configure quantization parameters (--quant_bit for 4 or 8 bit, --quant_block for block size, --hqq for improved quantization accuracy) and optionally merge LoRA weights (--lora_path) or apply GPTQ quantization (--gptq_path).

What happens:

Model architecture is analyzed and the computation graph is exported to ONNX
ONNX model is converted to MNN format via MNNConvert with transformer-specific fusion
Tokenizer vocabulary is extracted and saved as tokenizer.txt
Embedding weights are extracted as embeddings_bf16.bin (or reused from model weights for Tie-Embedding models)
Runtime configuration files (config.json, llm_config.json) are generated

Key considerations:

The --hqq flag is recommended for improved quantization precision
For models under 8B parameters with Tie-Embedding, embeddings are reused from weights by default
If direct MNN export fails, export to ONNX first (--export onnx) then convert manually with MNNConvert

Step 3: Compile MNN engine with LLM support

Build the MNN C++ engine from source with the required LLM compilation flags. The minimum required flag is -DMNN_BUILD_LLM=true. Add platform-specific optimizations: -DMNN_AVX512=true for x86, -DMNN_OPENCL=true for Android GPU, -DMNN_METAL=ON for iOS/macOS GPU. For multimodal (Omni) models supporting image and audio input, add -DMNN_BUILD_LLM_OMNI=ON.

Key considerations:

Minimum build flags: -DMNN_BUILD_LLM=true
x86 platforms benefit significantly from -DMNN_AVX512=true
Android builds use the project/android/build_64.sh script with additional flags
iOS builds use package_scripts/ios/buildiOS.sh
Web (WASM) builds require emcmake with -DMNN_FORBID_MULTI_THREAD=ON

Step 4: Configure runtime parameters

Create or edit the config.json file in the model directory to control inference behavior. Configure hardware backend (backend_type: cpu/opencl/metal), thread count, precision strategy (low for FP16), memory strategy (low for runtime quantization), KV cache reuse for multi-turn dialogue, mmap settings for memory-constrained devices, and sampling parameters (sampler_type, temperature, topK, topP, penalty).

Key considerations:

Set backend_type to "opencl" for Android GPU or "metal" for macOS/iOS GPU
Set use_mmap=true on mobile devices to avoid memory overflow
For diverse output, use sampler_type "mixed" with temperature 0.7
The penalty sampler prevents repetitive output via n-gram penalties
Set reuse_kv=true for efficient multi-turn conversations

Step 5: Run LLM inference

Execute the compiled llm_demo binary with the path to config.json for interactive chat mode, or provide an additional prompt.txt file for batch processing. For multimodal models, embed image references with <img> tags or audio references with <audio> tags in the prompt text. Optionally use llm_bench for performance benchmarking across different configurations.

Key considerations:

Interactive mode: ./llm_demo model_dir/config.json
Batch mode: ./llm_demo model_dir/config.json prompt.txt
Visual model prompts use <img>URL</img> syntax with optional <hw>height, width</hw> for size hints
Audio model prompts use <audio>URL</audio> syntax
Performance testing: ./llm_bench -m config.json -a cpu,opencl -t 4,8 -p 32,64

Execution Diagram

GitHub URL

Workflow Repository