Workflow:Alibaba MNN LLM Deployment Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Deployment, On_Device_AI |
| Last Updated | 2026-02-10 08:00 GMT |
Overview
End-to-end process for deploying large language models (LLMs) to on-device inference using MNN-LLM, covering model export from PyTorch/Hugging Face, engine compilation, runtime configuration, and interactive text generation.
Description
This workflow covers the complete pipeline for taking a Hugging Face or ModelScope LLM (such as Qwen, Llama, Baichuan, DeepSeek, or similar transformer models) and deploying it for efficient on-device inference using MNN's LLM runtime. The process involves exporting the model to MNN format using the llmexport tool (which handles tokenizer extraction, embedding export, and weight quantization), compiling the MNN engine with LLM support macros, configuring runtime parameters (hardware backend, sampling strategy, KV cache, memory management), and running inference through the llm_demo CLI or programmatic API. The pipeline supports text-only, visual (VL), and audio (Omni) multimodal models.
Key outputs:
- MNN-format LLM model files (llm.mnn, llm.mnn.weight, tokenizer.txt, embeddings, config files)
- Compiled MNN engine with LLM inference support (libMNN, libllm)
- Interactive chat or batch text generation capability on CPU, GPU (OpenCL/Metal/CUDA), or NPU
Usage
Execute this workflow when you need to deploy an open-source LLM (Qwen, Llama, Baichuan, DeepSeek, etc.) for on-device inference on mobile phones, PCs, embedded systems, or web browsers. This is the primary workflow for building local, privacy-preserving LLM applications without cloud API dependencies.
Execution Steps
Step 1: Prepare the source LLM
Clone the target LLM from Hugging Face or ModelScope to your local filesystem. Ensure git-lfs is installed and functional so that the full model weights are downloaded, not just pointer files. Verify the download by checking file sizes of the model weight files.
Key considerations:
- Install git-lfs before cloning: git lfs install
- Supported model families include Qwen, Llama, Baichuan, DeepSeek, ChatGLM, and others
- Verify model weight file sizes after cloning to confirm complete download
Step 2: Export model to MNN format
Run the llmexport.py script from the transformers/llm/export directory. This tool exports the PyTorch model to ONNX, converts it to MNN format, and extracts the tokenizer, embeddings, and configuration files. Configure quantization parameters (--quant_bit for 4 or 8 bit, --quant_block for block size, --hqq for improved quantization accuracy) and optionally merge LoRA weights (--lora_path) or apply GPTQ quantization (--gptq_path).
What happens:
- Model architecture is analyzed and the computation graph is exported to ONNX
- ONNX model is converted to MNN format via MNNConvert with transformer-specific fusion
- Tokenizer vocabulary is extracted and saved as tokenizer.txt
- Embedding weights are extracted as embeddings_bf16.bin (or reused from model weights for Tie-Embedding models)
- Runtime configuration files (config.json, llm_config.json) are generated
Key considerations:
- The --hqq flag is recommended for improved quantization precision
- For models under 8B parameters with Tie-Embedding, embeddings are reused from weights by default
- If direct MNN export fails, export to ONNX first (--export onnx) then convert manually with MNNConvert
Step 3: Compile MNN engine with LLM support
Build the MNN C++ engine from source with the required LLM compilation flags. The minimum required flag is -DMNN_BUILD_LLM=true. Add platform-specific optimizations: -DMNN_AVX512=true for x86, -DMNN_OPENCL=true for Android GPU, -DMNN_METAL=ON for iOS/macOS GPU. For multimodal (Omni) models supporting image and audio input, add -DMNN_BUILD_LLM_OMNI=ON.
Key considerations:
- Minimum build flags: -DMNN_BUILD_LLM=true
- x86 platforms benefit significantly from -DMNN_AVX512=true
- Android builds use the project/android/build_64.sh script with additional flags
- iOS builds use package_scripts/ios/buildiOS.sh
- Web (WASM) builds require emcmake with -DMNN_FORBID_MULTI_THREAD=ON
Step 4: Configure runtime parameters
Create or edit the config.json file in the model directory to control inference behavior. Configure hardware backend (backend_type: cpu/opencl/metal), thread count, precision strategy (low for FP16), memory strategy (low for runtime quantization), KV cache reuse for multi-turn dialogue, mmap settings for memory-constrained devices, and sampling parameters (sampler_type, temperature, topK, topP, penalty).
Key considerations:
- Set backend_type to "opencl" for Android GPU or "metal" for macOS/iOS GPU
- Set use_mmap=true on mobile devices to avoid memory overflow
- For diverse output, use sampler_type "mixed" with temperature 0.7
- The penalty sampler prevents repetitive output via n-gram penalties
- Set reuse_kv=true for efficient multi-turn conversations
Step 5: Run LLM inference
Execute the compiled llm_demo binary with the path to config.json for interactive chat mode, or provide an additional prompt.txt file for batch processing. For multimodal models, embed image references with <img> tags or audio references with <audio> tags in the prompt text. Optionally use llm_bench for performance benchmarking across different configurations.
Key considerations:
- Interactive mode: ./llm_demo model_dir/config.json
- Batch mode: ./llm_demo model_dir/config.json prompt.txt
- Visual model prompts use <img>URL</img> syntax with optional <hw>height, width</hw> for size hints
- Audio model prompts use <audio>URL</audio> syntax
- Performance testing: ./llm_bench -m config.json -a cpu,opencl -t 4,8 -p 32,64