Implementation:Alibaba MNN Llmexport Script

Field	Value
implementation_name	Llmexport_Script
implementation_type	API Doc
repository	Alibaba_MNN
workflow	LLM_Deployment_Pipeline
pipeline_stage	Model Export
source_file	transformers/llm/export/llmexport.py (L676-758)
last_updated	2026-02-10 14:00 GMT

Summary

The llmexport.py script is the primary tool for converting HuggingFace-format LLM models into MNN inference format. It handles model loading, ONNX export, MNN conversion, weight quantization, tokenizer extraction, and configuration generation. The script is built around the LlmExporter class (inheriting from torch.nn.Module) and supports a wide range of model architectures through the ModelMapper system.

API Signature

python llmexport.py --path <model_dir> --export mnn [options]

Full CLI Signature

usage: llmexport.py [-h] --path PATH [--type TYPE] [--tokenizer_path TOKENIZER_PATH]
                    [--eagle_path EAGLE_PATH] [--lora_path LORA_PATH] [--gptq_path GPTQ_PATH]
                    [--dst_path DST_PATH] [--verbose] [--test TEST] [--export EXPORT]
                    [--onnx_slim] [--quant_bit QUANT_BIT] [--quant_block QUANT_BLOCK]
                    [--visual_quant_bit VISUAL_QUANT_BIT] [--visual_quant_block VISUAL_QUANT_BLOCK]
                    [--lm_quant_bit LM_QUANT_BIT] [--lm_quant_block LM_QUANT_BLOCK]
                    [--mnnconvert MNNCONVERT] [--ppl] [--awq] [--hqq] [--omni]
                    [--transformer_fuse] [--group_conv_native] [--smooth] [--sym]
                    [--visual_sym] [--seperate_embed] [--lora_split]
                    [--calib_data CALIB_DATA] [--act_bit ACT_BIT] [--embed_bit EMBED_BIT]
                    [--act_sym] [--quant_config QUANT_CONFIG] [--generate_for_npu]
                    [--skip_weight] [--omni_epochs OMNI_EPOCHS] [--omni_lr OMNI_LR]
                    [--omni_wd OMNI_WD]

Source Reference

The argument parser is defined in the build_args() function at line 676, the programmatic export() entry point is at line 724, and the CLI main() entry point is at line 737 of:

transformers/llm/export/llmexport.py

Key Imports

import onnx
import torch

from utils.model import LlmModel, EmbeddingModel
from utils.tokenizer import LlmTokenizer
from utils.spinner import spinner_run
from utils.custom_op import FakeLinear
from utils.onnx_rebuilder import OnnxRebuilder
from utils.mnn_converter import MNNConverter
from utils.awq_quantizer import AwqQuantizer
from utils.smooth_quantizer import SmoothQuantizer
from utils.omni_quantizer import OmniQuantizer
from utils.torch_utils import onnx_export

Key Parameters

Parameter	Type	Default	Description
`--path`	str	required	Path to the HuggingFace model directory or model ID
`--export`	str	None	Export format: `'onnx'` or `'mnn'`
`--quant_bit`	int	4	Quantization bit-width for model weights (4 or 8)
`--quant_block`	int	64	Quantization block size (0 = channel-wise)
`--lm_quant_bit`	int	None	Separate quantization bit-width for LM head (defaults to `--quant_bit`)
`--lm_quant_block`	int	None	Separate quantization block size for LM head (defaults to `--quant_block`)
`--hqq`	flag	False	Enable Half-Quadratic Quantization for improved accuracy
`--awq`	flag	False	Enable Activation-Aware Weight Quantization
`--omni`	flag	False	Enable OmniQuant learned quantization
`--smooth`	flag	False	Enable Smooth Quantization
`--sym`	flag	False	Use symmetric quantization (no zero-point)
`--dst_path`	str	`'./model'`	Output directory for exported model files
`--tokenizer_path`	str	None	Custom tokenizer path (defaults to `--path`)
`--lora_path`	str	None	Path to LoRA weights for merging before export
`--gptq_path`	str	None	Path to pre-quantized GPTQ weights
`--eagle_path`	str	None	Path to EAGLE speculative decoding model
`--mnnconvert`	str	`'../../../build/MNNConvert'`	Path to local MNNConvert binary (falls back to pymnn)
`--seperate_embed`	flag	False	Separate embedding weights to `embeddings_bf16.bin` to avoid quantization
`--embed_bit`	int	16	Embedding export bit precision (choices: 16, 8, 4)
`--visual_quant_bit`	int	None	Quantization bit-width for visual encoder
`--visual_sym`	flag	False	Symmetric quantization for visual model
`--transformer_fuse`	flag	False	Fuse vision transformer operations
`--calib_data`	str	None	Calibration data path for quantization methods that require it
`--act_bit`	int	16	Smooth quantization activation bit-width (8 or 16)
`--test`	str	None	Test model inference with the given query string
`--skip_weight`	flag	False	Skip loading model weights (for testing export flow)
`--generate_for_npu`	flag	False	Generate model for NPU deployment

Inputs

HuggingFace model directory containing config.json, tokenizer files, and weight files (.safetensors or .bin)

Outputs

The exported model directory (default ./model) contains:

File	Description
`llm.mnn`	The MNN model graph file
`llm.mnn.weight`	The quantized weight data file
`llm.mnn.json`	MNN model JSON (for LoRA/GPTQ post-processing)
`llm_config.json`	Model configuration for the MNN runtime (hidden_size, layer_nums, key_value_shape, prompt_template)
`config.json`	Runtime inference configuration (modifiable by user)
`tokenizer.txt`	Extracted tokenizer in MNN-compatible text format
`embeddings_bf16.bin`	Embedding weights in bf16 (only if `--seperate_embed` or non-tie-embedding model)
`onnx/llm.onnx`	Intermediate ONNX model (not needed for inference)

Usage Examples

Basic 4-bit Export

cd transformers/llm/export
python llmexport.py \
    --path /path/to/Qwen2-0.5B-Instruct \
    --export mnn

Export with HQQ Quantization

python llmexport.py \
    --path /path/to/Qwen2-0.5B-Instruct \
    --export mnn \
    --hqq \
    --quant_bit 4 \
    --quant_block 64 \
    --dst_path ./qwen2_mnn

Export with LoRA Merging

python llmexport.py \
    --path /path/to/Qwen2-0.5B-Instruct \
    --lora_path /path/to/lora_weights \
    --export mnn \
    --quant_bit 8

Two-Step Export via ONNX

# Step 1: Export to ONNX
python llmexport.py \
    --path /path/to/Qwen2-0.5B-Instruct \
    --export onnx

# Step 2: Manual MNN conversion with custom quantization
./MNNConvert --modelFile ./model/onnx/llm.onnx \
    --MNNModel llm.mnn \
    --keepInputFormat \
    --weightQuantBits=4 \
    --weightQuantBlock=128 \
    -f ONNX \
    --transformerFuse=1 \
    --allowCustomOp \
    --saveExternalData

Programmatic API

from llmexport import export

export('/path/to/Qwen2-0.5B-Instruct', export='mnn', quant_bit=4, hqq=True)

The export() function (line 724) provides a programmatic interface:

def export(path, **kwargs):
    parser = argparse.ArgumentParser()
    build_args(parser)
    args = parser.parse_args(['--path', path])
    for k, v in kwargs.items():
        setattr(args, k, v)
    if 'bge' in path:
        llm_exporter = EmbeddingExporter(args)
    else:
        llm_exporter = LlmExporter(args)
    llm_exporter.export(args.export)

Notes

The LlmExporter.__init__ method sets max_new_tokens = 1024 and dst_name = 'llm' as defaults.
If --lm_quant_bit or --lm_quant_block is not specified, they default to the values of --quant_bit and --quant_block respectively.
Embedding models (BGE, GTE, Qwen3-Embedding) are handled by the EmbeddingExporter class instead of LlmExporter.
The script requires either pymnn installed or a valid MNNConvert binary path for the MNN conversion step.

Related Pages

Principle:Alibaba_MNN_LLM_Model_Export
Environment:Alibaba_MNN_Python_Export_Environment
Environment:Alibaba_MNN_HuggingFace_Ecosystem_Environment
Implementation:Alibaba_MNN_HuggingFace_Model_Download - Previous step: downloading the model
Implementation:Alibaba_MNN_CMake_Build_LLM - Next step: compiling the inference engine

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment